Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
A New Real-Time FPGA-Based Implementation of K-MeansClustering for Images
Deng, T., Crookes, D., Siddiqui, F., & Woods, R. (2018). A New Real-Time FPGA-Based Implementation of K-Means Clustering for Images. In Z. Yang, D. Yang, K. Li, M. Fei, & D. Du (Eds.), Intelligent Computing andInternet of Things: First International Conference on Intelligent Manufacturing and Internet of Things and 5thInternational Conference on Computing for Sustainable Energy and Environment, Chongqing, China, September21-23, 2018 (Vol. 924, pp. 468-477). (Communications in Computer and Information Science; Vol. 924).Springer-Verlag. https://doi.org/10.1007/978-981-13-2384-3_44Published in:Intelligent Computing and Internet of Things
Document Version:Peer reviewed version
Queen's University Belfast - Research Portal:Link to publication record in Queen's University Belfast Research Portal
Publisher rightsCopyright 2018 Springer. This work is made available online in accordance with the publisher’s policies. Please refer to any applicable termsof use of the publisher.
General rightsCopyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or othercopyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associatedwith these rights.
Take down policyThe Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made toensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in theResearch Portal that you believe breaches copyright or violates any law, please contact [email protected].
Download date:14. May. 2020
A New Real-time FPGA-based
Implementation of K-means
Clustering for Images Tiantai Deng1 , Danny Crookes1, Fahad Siddiqui1 and Roger Woods1,
1 Queen’s University Belfast,
University Road, Belfast, UK
{tdeng01, d.crookes, f.siddiqui,r.woods }@qub.ac.uk
Abstract. As an unsupervised machine learning algorithm, K-means clustering for
images has been widely used in image segmentation. The standard Lloyd’s algorithm
iteratively allocates all image pixels to clusters until convergence. The processing
requirement can be a problem for high-resolution images and/or real-time systems. In this
paper, we present a new histogram-based algorithm for K-means clustering, and its
FPGA implementation. Once the histogram has been constructed, the algorithm is O(GL)
for each iteration, where GL is the number of grey levels.
On a Xilinx ZedBoard, our algorithm achieves 140 FPS (640×480 images, running at
150MHz, 4 clusters, 25 iterations), including final image reconstruction. At 100MHz it
achieves 95 FPS. It is 7.6 times faster than the standard Lloyd’s algorithm, but uses only
approximately half of the resources, while giving the same results. The more iterations,
the bigger the speed-up. For 50 iterations, our algorithm is 10.2 times faster than the
Lloyd’s approach. Thus for all cases our algorithm achieves real time performance
whereas Lloyd’s struggles to do so. The number of clusters (up to a user-defined limit)
and the initialization method (one of three) can be selected at runtime.
Keywords: Unsupervised Machine Learning, Data Processing, K-means
Clustering, FPGA Acceleration
1 Introduction
Clustering problems arise in many applications [1]. K-means clustering is an
important pre-processing step in applications such as medical image segmentation [2],
pattern recognition, machine learning, and bio-informatics [3-4]. K-means clustering
is the process of grouping elements of a data set into several subsets, based on
minimizing the total Euclidean distance between cluster elements and the cluster
1 Please note that the LNCS Editorial assumes that all authors have used the western
naming convention, with given names preceding surnames. This determines the
structure of the names in the running heads and the author index.
means. It is an unsupervised computational clustering algorithm to distinguish
different objects or regions in the image. The standard algorithm for K-means
clustering is Lloyd’s algorithm, which iteratively adjusts the clusters until
convergence (see section III). The common implementation platform is a CPU.
Lloyd’s algorithm can be efficiently coded.
However, when real-time processing is required, for videos or for images with very
high resolution, achieving real time performance of the standard Lloyd’s algorithm
becomes a problem, because of the iterative processing of the complete image [5].
Thus, an alternative way of implementing K-Means clustering is needed for real-time
processing. The standard Lloyd’s algorithm is naturally parallel [6]. We can improve
the performance either by using different hardware (e.g. multiple CPUs, or a GPU, or
an FPGA), or by using a faster algorithm. In this paper, we propose a new algorithm
for implementing K-means clustering which achieves significant speed-up both by
algorithm innovation and by hardware implementation on an FPGA. The FPGA
architecture can achieve real time performance for video processing or for large
image resolutions. The algorithm is considerably faster than the standard Lloyd’s
algorithm. Indeed, the algorithm gives significant speed-up even on a CPU. The
main contributions of the paper are as follows:
1) A new histogram-based implementation of K-means clustering. Our method
gives the same result as the standard Lloyd’s algorithm for image data, but is much
more efficient.
2) An FPGA implementation on the Xilinx ZedBoard. Our method is 7.6 times
faster but uses 50% of the resources compared to the Lloyd’s FPGA implementation.
3) A comparison of two different FPGA implementations: our new Histogram-
based K-means method, and the standard Lloyd’s K-means method, plus a
performance comparison with three FPGA implementations from the literature.
2 Related Work
A. Standard Lloyd’s Algorithm for K-means Clustering
The K-means problem is to find K clusters such that the means μc of each cluster
Clc satisfy equation (1).
2
1
minargx cc
K
x c
c p ClCl
p
(0 ≤ c < K) (1)
The standard Lloyd’s algorithm is an iterative algorithm to solve the K-
means problem. The data set needs to be processed several times until the minimum
distance over all clusters is achieved. There are two steps in Lloyd’s algorithm: The
first is to assign the elements of the data set to the cluster with the closest mean. The
second is to update the cluster means. These steps are defined by equations (2) and
(3), for iteration t. The typical pseudocode is shown in Figure 1.
The Assignment Step:
2 2
: ,0t t t
c x x c x jCl p p p j j K (2)
The Update Step:
1 1c
ti c
t
c Cltp Clc
pCl
(3)
Since the algorithm needs to access the data set to calculate the means and
distances at each iteration, it is necessary to keep a copy of the original data set in
memory throughout the FPGA implementation.
Figure 1. Pseudocode for the standard Lloyd’s Algorithm
B. FPG Implementation of K-means Clustering
A number of researchers have already studied the FPGA implementation of K-means
clustering. In [7], Lavenier feeds the image into the distance calculation units, and
indexes every pixel in another array. Due to limited resources on the FPGA, the
image is stored in off-chip memory, and must then be streamed multiple times. This
entails extra communication overheads between the FPGA and off-chip memory.
Michael Estlick et al. implemented K-Means in hardware using software/hardware co-
design. The distance calculation is computed in hardware and the rest of the
algorithm is completed on a CPU to avoid using a huge amount of hardware resources
[8]. And in [4], Yuk-Ming and Choi etc. use three FPGAs which work together to
deal with the large amount of data because of the limited resources on each FPGA. In
[9], Benkrid implemented a highly parameterized parallel system, and his
implementation is not dedicated to image clustering but shows the potential of using
FPGAs to accelerate the clustering process. Takashi, Tsutomu et al. implemented K-
means on a Xilinx FPGA, using KD-Tree, but they are limited by the resources on the
FPGA, and they have to store the image in an off-chip SRAM. They report a
performance of 30FPS on an RGB 24-bit image [10-11]. In [11], the approach
1: Function Lloyd (inputImage, K)
2: for all c 0..K-1 do mean[c] = Initial value 3: do 4: totalDistance = 0
5: for all x,y domain(inputImage) do 6: p = inputImage[x,y]
7: find c 0..K-1 such that abs(p-mean[c]) is minimum 8: totalDistance += (p-mean[c])2 9: sum[c] += p 10: size[c] += 1
11: for all c 0..K-1 do 12: if (size[c] ≠ 0) mean[c] = sum[c] / size[c] 13: while (totalDistance has not converged) 14: return result, means 15: end Figure 1. Pseudocode for the standard Lloyd’s algorithm.
replaces the Euclidean distance into the multiply-less Manhattan distance and reduces
the word length to save resources. The approaches above focus on achieving greater
efficiency by adjusting the word-length, simplifying the distance calculation or using
a heterogeneous platform.
Others try to reduce the amount of data. In [1], a filter algorithm is applied first
before K-means to reduce the amount of data. Tsutomu reduces the scanning data
size by doing partial K-means, whereby the algorithm is applied on 1/8, 1/4, and 1/2
sampled data. This simplifies the K-means algorithm to get better performance but
also introduces inaccuracy into the result [12].
Another way to speed up Lloyd’s algorithm is to reduce the number of iterations.
The performance of Lloyd’s is very dependent on the starting point. In [13], Khan and
Ahmad used a small amount of data to execute K-means to get a better estimate of the
initial mean for each cluster. Pena, Lozano, and Larranaga compared four classical
initialization methods for the K-means algorithm: Random, Forgy, MacQueen, and
Kaufman. The author found that the Random and Kaufman initialization make the
algorithm more effective and independent of the initial value which means it is more
robust and effective [14].
The Soft-processor approach is another way of using FPGAs. This retains the high
level programming approach, and hides much of the complexity of hardware from
developers. In [15], a multi-processor architecture is proposed for K-means based on
a network on chip, and using a shared memory to support communication between
processors.
To summarize the above approaches, there are three existing ways to speed up the
process of K-means clustering using an FPGA. One is to focus on the system
architecture, parallelizing the system and finding more effective ways to allocate
memory, such as in [7] [9] and [10]. The second way is to reduce the amount of data,
such as in [1] and [12]. The third way is to reduce the number of iterations by finding
better initial means for each cluster, as in [13] and [14].
We now present our novel approach to implementing K-means clustering, and then
present its implementation on a single chip of FPGA including camera and display
control.
3 Histogram-based K-means Clustering
We first note that the K-means problem for image processing is more constrained
than K-means for other kinds of data. The range of the input data is discrete, e.g. for
an 8-bit grayscale image, the grey levels are from 0 to 255.
Secondly, the clustering of pixels does not need to take the location of pixels into
account. Therefore we can represent the input image by the image histogram (for the
purposes of clustering). The image histogram (which does not change throughout the
iterative clustering process) is effectively a compressed version of the image. For any
grey level value g, all pixels with that value can be clustered at the same time, with a
single calculation.
Thirdly, we can represent each cluster c merely as a sub-range of the histogram,
from column Bc to column B(c+1)-1 inclusive, where Bi is the lower bound of cluster i.
Fourthly, given the means of each cluster c, the new bounds, which define the new
clusters, will be simply half way between the respective means. Figure 2 shows a
typical histogram and the representation of four clusters.
For an Nx×Ny image, and doing L iterations, the complexity of the standard Lloyd’s
algorithm is Nx×Ny×L. The complexity of our new histogram-based algorithm will be
Nx×Ny+256×L (assuming 256 grey levels), where the Nx×Ny component is for
calculating the histogram. For example, for a 512×512 image with L = 20, the
standard method will require around 5.2M steps, whereas our histogram-based
method will require around 0.267M steps – an improvement, in theory, of the order of
20x. In practice, there will be the additional overhead of image input and perhaps
reconstruction of the output image and the number of iterations will be image-
dependent. These factors will typically have the effect of reducing the apparent
speed-up.
Cluster1 Cluster2 Cluster3 Cluster4 Figure 2. An image histogram and representation of four clusters
For an 8-bit image, the histogram is a one-dimensional vector with 256 elements,
where each element in the histogram represents the number of pixels having that grey
level. After generating the histogram, bounds are needed to divide the histogram into
different clusters. Each cluster is defined by a pair of bounds, which define the part of
the histogram corresponding to the cluster. If we have K clusters, there will be K+1
bounds, where B0 = 0, and BK = 256. Cluster c (0≤c<K) is the part of the histogram
from column Bc to B(c+1)-1 inclusive. The inner bounds (0 < c < K) can be initialized
in different ways (see later). Then, the means for the current clusters are calculated as
defined by equation (4).
1
1
1
1
( )
(0 )
( )
C
C
C
C
B
i
i B
c B
i
i B
H i
C K
H
(4)
Once the updated means are calculated, the next iteration first assigns the clusters
by merely updating the bounds. To assign a pixel to the correct new cluster, we have
to see which mean it is closest to. Consider a pixel with value g (say, 150) in an
image with a histogram such as in Figure 2, and the means are [0.0, 27.5, 90.5, 165.0,
220.0, 256.0]. Column g will be part of the cluster whose mean is closest to g. We
can achieve this simply by moving the bounds so that each bound is half way between
the two respective means. The updating of the clusters is defined by equation (5).
1( )(0 )
2
C CCB C K
(5)
The iteration can then be completed by updating the means using equation (4).
We use integers to hold the bounds, so that they correspond to actual pixel grey
levels. In our FPGA implementation we use simple fixed point to represent the
means.
Termination of the iterative algorithm could be brought about under several
conditions. The total distance can be readily calculated as in the standard Lloyd’s
algorithm. However, in our preferred approach, the termination condition is that none
of the bounds are changed by the updating (equation (5)). This has the advantage that
the total (squared) distance does not need to be calculated. A third option is to set a
fixed number of iterations (say, 50), which can be sufficient for many applications in
practice. This number may be application-specific, and can be set at runtime.
There are several ways to initialize the K-means algorithm such as Random, or
MacQueen and Kaufman, as mentioned in [14]. Actually, in our histogram method,
we initialize the cluster bounds rather than the means. We provide two particular
ways to do this. The first is simply to divide the grey level range into K equal parts.
The second way is to divide all the pixels into K equally sized clusters. The latter can
be achieved by using the cumulative histogram and setting the bounds to the columns
which have the appropriate height.
Dividing the histogram range into K equal parts takes the least resources in a
hardware implementation. However, in an image with an unevenly distributed
histogram, some clusters may be empty (for instance, if the image is overexposed or
too dark). The danger condition of calculating the mean of a zero-element cluster can
be overcome with a suitable condition; however, if we have two adjacent zero-
element clusters, the bounds would not move and those clusters would not grow.
Thus, the safest way to initialize the clusters is to set each cluster initially to have the
same number of pixels (approximately), using the cumulative histogram. We can
build the cumulative histogram from the histogram during initialization. A refinement
which is easy to implement is for the user to supply an initial, application-specific
estimate of the relative sizes of each cluster as a parameter.
4 System Implementation
We implemented the histogram-based K-means Clustering algorithm on a Xilinx
FPGA, using Vivado HLS on a ZedBoard and running at either 150 MHz or 100
MHz. And the architecture is shown in the figure 3 below.
The architecture of the FPGA implementation is simple. The system is made up of
two parts, the I/O part and the processing part. The whole system is controlled by the
AXI-Lite bus. The camera repeatedly captures a frame and stores it in the frame
buffer. Then the data is streamed to the input FIFO and fed into the K-means
hardware. After the image is processed, the bounds and means are sent to the image
reconstruction hardware and the output frame buffer. We use a physical VGA
interface for display. The architecture is shown in figure 4. The K-means hardware
and the image reconstruction hardware are programmed in HLS. The input and output
images are transferred using the AXI-Stream interface. The means and bounds of each
cluster are held in a BRAM.
For comparison purposes, we run both the Lloyd’s and our K-means at both
100MHz and 150MHz. The ARM output 100/150MHz clock is divided into 50MHz
and 25MHz. 50MHz is used in the camera and display, and 25MHz is used in the data
transfer through the FIFOs and frame buffers. The FIFOs are used to help synchronise
data across the clock domain between the frame buffer and our K-means hardware.
Figure 3. System Architecture
5 Results and Comparison
We compare the performance of two designs: the standard Lloyd’s algorithm and
our new histogram-based algorithm. For the hardware versions, we consider several
factors: hardware usage, the clock cycles required to process an image (4 clusters),
and the actual number of frames per second (FPS).
Table 1. Hardware Utilization (Including Reconstruction of Output Image)
100MHz LUTs FFs DSPs BRAMs
Histogram 1194 749 1 40.5
Lloyd's 2641 1535 3 38.5
150MHz LUTs FFs DSPs BRAMs
Histogram 1807 2770 4 40.5
Lloyd's 3793 4350 3 38.5
The utilization of the different designs is shown in Table I. It can be seen that the
histogram-based algorithm takes about half the LUT (Look up table) and FF (Flip-
Flop) resources.
Since even at 100MHz our histogram-based architecture is able to achieve real time
performance, the clustering itself does not demand using the fastest clock speed; and
since the 100MHz version uses rather less hardware resources, its use would therefore
normally be preferred.
Table 2. Comparison with other literatures
We also compare the performance of our algorithm to the implementations in [13-
15]. The work in [14] takes a 24-bit RGB image of size 640×480 and has a
performance of 30FPS; the pixels in RGB channel are clustered in parallel. And the
work in [15] and [13] is using 640×480 grayscale image. [15] uses the Manhattan
distance to replace Euclidean distance, and [13] reduces the data set by sampling the
image to get better performance. Those two approaches are based on the standard
Lloyd’s algorithm but cannot guarantee full accuracy, and have less performance than
our new algorithm. The superiority of our new method can be seen.
Hence, the new histogram-based algorithm is the best choice when multi-clusters is
needed. It has more advantages over the existing algorithms, it takes fewer resources,
better performance than full K-means, and it is easy to adapt the stream
implementation.
5 Conclusion and Future Work
This paper has presented a new K-means clustering algorithm, based on iteratively
processing the image histogram rather than the image. The main contributions are as
follows:
1. A new histogram-based K-means clustering algorithm which transforms the 2-
D image into a 1-D histogram. The histogram retains all the details which are
needed to perform K-means clustering. The method is equivalent to Lloyd’s,
and gives the same results. Even implementation on a CPU can achieve
19FPS performance, a speed-up of 9.1 over Lloyd’s.
2. An FPGA implementation of the new histogram-based K-means clustering
algorithm. On an FPGA, the histogram-based takes approximately half the
resources compared to Lloyd’s, and a speed-up of approximately 7.6. Even at
100MHz and with 50 iterations, we achieve 82 FPS for 640x480 images.
100MHz, L=25 FPS. Iterations
Histogram(HW) 95 25
Lloyd’s(HW) 12.5 25
[13] 30 --
[14] 38.6 --
[12] 45.8 --
Future work will look at providing greater flexibility in the FPGA architecture,
including a wider range of cluster initialization strategies, and some possible further
optimizations.
6 Acknowledgement
This work was supported by the Chinese Scholarship Council.
References
[1] Kanungo T, Mount D M, Netanyahu N S, et al. An efficient k-means clustering algorithm: Analysis and implementation. IEEE transactions on pattern analysis and machine intelligence,
2002, 24(7): 881-892.
[2] Ng H P, Ong S H, Foong K W C, et al. Medical image segmentation using k-means clustering and improved watershed algorithm. Proc. Image Analysis and Interpretation, 2006 IEEE Southwest
Symposium on. IEEE, 2006: 61-65. [3] Duda R O, Hart P E, Stork D G. Pattern classification and scene analysis 2nd ed. ed: Wiley
Interscience, 1995
[4] Choi Y M, So H K H. Map-reduce processing of k-means algorithm with FPGA-accelerated computer cluster. Proc. Application-specific Systems, Architectures and Processors (ASAP), 2014
IEEE 25th International Conference on. IEEE, 2014: 9-16. [5] Neshatpour K, Koohi A, Farahmand F, et al. Big biomedical image processing hardware
acceleration: A case study for K-means and image filtering[C]//Circuits and Systems (ISCAS), 2016
IEEE International Symposium on. IEEE, 2016: 1134-1137.K Farivar R, Rebolledo D, Chan E, et al. A Parallel Implementation of K-Means Clustering on GPUs[C]//Pdpta. 2008, 13(2): 212-312.
[6] 10Lavenier D. FPGA implementation of the k-means clustering algorithm for hyperspectral images. Proc. Los Alamos National Laboratory LAUR. 2000.
[7] Estlick M, Leeser M, Theiler J, et al. Algorithmic transformations in the implementation of K-means
clustering on reconfigurable hardware. Proc. Proceedings of the 2001 ACM/SIGDA ninth international symposium on Field programmable gate arrays. ACM, 2001: 103-110.
[8] Hussain H M, Benkrid K, Erdogan A T, et al. Highly parameterized K-means clustering on FPGAs: Comparative results with GPPs and GPUs. Proc. Reconfigurable Computing and FPGAs
(ReConFig), 2011 International Conference on. IEEE, 2011: 475-480.
[9] Saegusa T, Maruyama T, Yamaguchi Y. How fast is an FPGA in image processing? Proc. Field Programmable Logic and Applications, 2008. FPL 2008. International Conference on. IEEE, 2008:
77-82. [10] Saegusa T, Maruyama T. An FPGA implementation of k-means clustering for color images based on
Kd-tree. Proc. Field Programmable Logic and Applications, 2006. FPL'06. International Conference
on. IEEE, 2006: 1-6. [11] Maruyama T. Real-time k-means clustering for color images on reconfigurable hardware. Proc.
Pattern Recognition, 2006. ICPR 2006. 18th International Conference on. IEEE, 2006, 2: 816-819.
[12] Khan S S, Ahmad A. Cluster center initialization algorithm for K-means clustering. Pattern
recognition letters, 2004, 25(11): 1293-1302.
[13] Pena J M, Lozano J A, Larranaga P. An empirical comparison of four initialization methods for the k-means algorithm. Pattern recognition letters, 1999, 20(10): 1027-1040.
[14] Khawaja S G, Akram M U, Khan S A, et al. A novel multiprocessor architecture for k-means clustering algorithm based on network-on-chip. Proc. Multi-Topic Conference (INMIC), 2016 19th
International. IEEE, 2016: 1-5.