Upload
others
View
40
Download
0
Embed Size (px)
Citation preview
Mapping (SLAM) using High-Level SynthesisFPGA-Based Simultaneous Localization and
Academic year 2018-2019
TechnologyMaster of Science in Electrical Engineering - main subject Communication and InformationMaster's dissertation submitted in order to obtain the academic degree of
Counsellors: Dr. ir. Jan Aelterman, Ir. Michiel VlaminckSupervisors: Prof. dr. ir. Bart Goossens, Prof. dr. ir. Erik D'Hollander
Student number: 01404852Basile Van Hoorick
Admisson to Loan
The author gives his permission to make this master’s dissertation available for con-
sultation and to copy parts of this master’s dissertation for personal use. In all cases
of other use, the copyright terms have to be respected, in particular with regard to
the obligation to explicitly state the source when quoting results from this master’s
dissertation.
Basile Van Hoorick, May 2019
Acknowledgements
First and foremost, I would like to express my sincerest gratitude towards Prof. dr.
ir. Bart Goossens and Prof. em. dr. ir. Erik D’Hollander for giving me the opportu-
nity to conduct this master’s dissertation at the Department of Telecommunications
and Information Processing. I truly appreciate their vast expertise and would like
to thank them for their guidance towards making substantiated decisions, as well as
for their outstanding passion in their respective fields of expertise.
In particular, Prof. em. dr. ir. Erik D’Hollander of the Department of Elec-
tronics and Information Systems has been extremely helpful with regard to the so-
phisticated practicalities of testing heterogeneous computer systems. I have learned
an enormously great deal about Field-Programmable Gate Arrays over the past ten
months, and I could not possibly have wished for a more driven and competent su-
pervisor than him.
I also want to thank Prof. dr. ir. Bart Goossens for offering his aid and exten-
sive knowledge regarding Simultaneous Localization and Mapping, as well as for
providing me with helpful suggestions and tips throughout the year. Furthermore,
I am grateful to Prof. dr. ir. Wilfried Philips, Prof. dr. ir. Peter Veelaert and other
researchers at the Image Processing and Interpretation group for their valuable feed-
back and advice given during the two intermediate thesis presentations.
Last but not least, I would like to thank my parents, family and friends for their
indispensable support and encouragement throughout the entire period of my stud-
ies. Distinct credit goes to Tinus Pannier, Clemens Schlegel, Jacques Van Damme
and Viktor Verstraelen, with whom I have shared many pleasant breaks and memo-
rable moments during this exceptionally busy year.
Basile Van Hoorick, May 2019
FPGA-Based Simultaneous Localizationand Mapping using High-Level Synthesis
by
Basile VAN HOORICK
Master’s dissertation submitted in order to obtain the academic degree of
MASTER OF SCIENCE IN ELECTRICAL ENGINEERING
Academic year 2018-2019
Promoters: Prof. dr. ir. Bart GOOSSENS, Prof. em. dr. ir. Erik D’HOLLANDER
Supervisors: dr. ir. Jan AELTERMAN, ir. Michiel VLAMINCK
Faculty of Engineering and Architecture
Ghent University
Department of Telecommunications and Information Processing
Chairman: Prof. dr. ir. JORIS WALRAEVENS
Abstract
The growing popularity of SLAM is despite the lack of an embedded, low-power
yet real-time solution for dense 3D scene reconstruction. An attempt to fill this gap
with the Xilinx Zynq-7020 SoC resulted in the formation and evaluation of a de-
tailed methodology that tackles several types of typical routines in the image pro-
cessing domain using HLS. The devised principles and guidelines are then tested
by applying them to eight kernels of an established 3D SLAM application, reveal-
ing powerful potential and an estimated holistic speed-up of ×40.4 over execution
on the ARM Cortex-A9 CPU. Multi-modal, multi-resolution dataflow architectures
are subsequently proposed and compared with the purpose of efficiently mapping
algorithmic blocks and their interconnections to hardware while conforming to the
FPGA’s limitations. A trade-off between area and throughput appears to be the de-
ciding factor, although further research is desired towards merging the two Pareto-
optimal identified techniques.
Keywords
Simultaneous Localization and Mapping, Field- Programmable Gate Array, High-
Level Synthesis, Image Processing, System-on-Chip
FPGA-Based Simultaneous Localization and
Mapping using High-Level Synthesis
Basile Van Hoorick
Supervisors: Prof. dr. ir. Bart Goossens, Prof. em. dr. ir. Erik D’Hollander
Abstract — The growing popularity of SLAM is despite the
lack of an embedded, low-power yet real-time solution for dense
3D scene reconstruction. An attempt to fill this gap with the
Xilinx Zynq-7020 SoC resulted in the formation and evaluation of
a detailed methodology that tackles several types of typical
routines in the image processing domain using HLS. The devised
principles and guidelines are then tested by applying them to
eight kernels of an established 3D SLAM application, revealing
powerful potential and an estimated holistic speed-up of x40.4
over execution on the ARM Cortex-A9 CPU. Multi-modal, multi-
resolution dataflow architectures are subsequently proposed and
compared with the purpose of efficiently mapping algorithmic
blocks and their interconnections to hardware while conforming
to the FPGA’s limitations. A trade-off between area and
throughput seems to be the deciding factor, although further
research is desired towards merging the two Pareto-optimal
identified techniques.
Keywords — Simultaneous Localization and Mapping, Field-
Programmable Gate Array, High-Level Synthesis, Image
Processing, System-on-Chip
I. INTRODUCTION
As we embark on the road towards a more autonomous
world, countless challenges and opportunities emerge in
various subdisciplines of computer architecture, algorithm
design and electronics. One such challenge is Simultaneous
Localization and Mapping (SLAM), which attempts to make a
robot aware of its surroundings. The goal of SLAM is to track
the position and orientation of an agent within an unknown
environment, while simultaneously constructing a model of
this very environment [1]. Dense SLAM variants distinguish
themselves from their sparse counterparts by incorporating as
much sensor data as possible into their global reconstruction.
However, their considerable advantage in the form of
producing a high-quality model that is reusable across
applications comes at the cost of far greater computational
complexity [2]. At the same time, embedded SLAM solutions
are in high demand due to their many use cases on mobile and
low-power devices such as autonomous vehicles [3].
In this master’s dissertation, a framework is presented by
which SLAM and by extension, image processing kernels in
general, can be mapped effectively onto Field-Programmable
Gate Arrays (FPGAs). The FPGA is a reconfigurable
integrated circuit that can reach high performance yet low
power consumption [4], offering a flexible platform on which
to evaluate the hardware implementation of a dense 3D SLAM
algorithm. High-Level Synthesis (HLS) tools are employed
because of their apt capability to perform high-level, pragma-
directed compilation of C-code into hardware [5]. The use
case of choice is KinectFusion, a prominent scene
reconstruction algorithm [6] that is representative of diverse
paradigms in both 2D and 3D image processing. The only
existing work in literature that accelerates parts of
KinectFusion on an FPGA also uses a GPU [7], which is
avoided in this thesis due to its high energy consumption. We
also explore how multiple kernels with complex dataflow
characteristics can be combined in hardware as to form an
efficient, large-scale pipeline consisting of functional blocks.
II. HLS DESIGN OF INDIVIDUAL KERNELS
A. Methodology
Every kernel under consideration can be categorized
according to one or multiple parallel patterns most closely
associated with its computational and/or data management
structure [8]. Techniques are developed to deal with the
following patterns in HLS:
• Map & Reduce: The independence of every input
(and output) pixel lends itself to the application of
pipelining and AXI streaming interfaces, enforcing the
single-read, single-write principle of every element in
the array while overlapping multiple instances of
similar calculations in time as to enable efficient use of
DSPs and other hardware blocks.
• Stencil: In addition to the above, line buffers and
memory windows (see Figure 1) must be inserted in
order to fully exploit data reuse and preserve the I/O
streaming model [9][10]. Further speed-ups are
obtained by partitioning both arrays in certain
dimensions across multiple instances of local storage,
which prevents the internal block RAM from causing a
bottleneck due to the high amount of concurrent data
accesses.
• Gather: Reads from irregular positions in large arrays
are more complicated to handle on an FPGA due to its
limited local memory size. As continuous DDR requests
to DRAM form significant bottlenecks in practice [11],
the use of scratchpads is recommended to cache
(portions of) the region of interest. Multiple re-
executions of the subroutine might be necessary to
adequately deal with all required data.
Figure 1: Interaction between the line buffer and window for Stencil-
type kernels, visualized onto the input image (left) and as how they
are structured in memory (right).
An initiation interval (II) of one clock cycle is the goal in
the majority of cases, so that no further speed-up is possible
unless processing elements would be duplicated. The selection
among fixed-point versus floating point data type
representations depends on the complexity and kind of
operations employed in each kernel, but the former usually
results in a more hardware-efficient design, despite the
possible overhead introduced by conversions.
B. Implementation of KinectFusion
Eight SLAM kernels are examined and optimized in Vivado
HLS, leading to a median speed-up of x30.5 by purely
applying the presented methodology over leaving the code
unchanged. Additional transformations that require thorough
insight into the use case as well as statistical analysis of typical
values in various steps of the algorithm using real-world data,
lead to an additional median speed-up of x2.45 and further
decreases in resource utilization. According to this evaluation,
the most significant performance gains clearly originate from
the discussed standard approaches, although it remains
important to incorporate application-specific knowledge as
well to avoid superfluous hardware usage and suboptimal
designs.
III. COMBINED ACCELERATION OF MULTIPLE KERNELS
A. Problem statement and initial configuration
The complex, multi-resolution nature of tracking is reflected
in its requirement of seven output streams from the preceding
stages of KinectFusion, shown in Figure 2. Traditional task-
level pipelining does not capture how stream duplication or
multi-modal paths should be handled. The dataflow can be
broken down into two more general challenges: one is the
accumulation of intermediate results down a pipelined path,
and the other concerns creating multi-modal blocks as to
maximize resource sharing across different functional paths.
Three distinct ways are proposed and compared in which both
of these issues can be resolved. The first one places all
accelerators independently on the FPGA, each with its own
AXI DMA, and all data is passed via DRAM. The described
difficulties are kind of avoided this way; however, it is
expected however that better results will be achieved once
task-level pipelining between subsequent blocks is employed.
Figure 2: Dataflow diagram of KinectFusion's first five kernels.
B. Block-level and HLS-level pipelined architectures
In the Vivado block design, collecting intermediate outputs
is done by redirecting the needed streams from in-between
multiple components directly back to the processing system
via an AXI DMA. Multiple modes can be activated either by
setting control signals via the AXI-Lite protocol, or by
inserting stream switching IP cores to enable the selection
among different blocks altogether.
The same principles can also be applied at the level of
Vivado HLS, albeit after taking special measures to reconcile
them with the HLS dataflow optimization directive. This
includes strict adherence to the single-producer, single-
consumer paradigm and the non-conditional execution of
blocks. Intermediate output aggregation is achieved by
programming virtual pass-through connections and having
each kernel attach its own output values to the increasingly
wide stream of interleaved data. Multi-modality of kernels is
translated to if-else case-switching inside loop bodies.
C. Application to KinectFusion
In the dataflow graph, modes are defined to correspond to
different resolution levels; this produces the fastest allocation
of paths inside which to pipeline all components. Assuming all
other components of the KinectFusion system (reading sensor
frames, tracking, volumetric integration etc.) work sufficiently
fast, the resulting measurements on an Avnet Zedboard with a
PL clock period of 10 ns are as follows:
Configuration Initiation
interval
Max. frame
rate
Avg. resource
usage
Coexistence 2.53 ms 395 FPS 52 %
Block-level dataflow 2.10 ms 476 FPS 45 %
HLS-level dataflow 4.13 ms 242 FPS 35 %
The first configuration involving independent accelerators is
Pareto-dominated by the block-level dataflow architecture. Its
HLS-level counterpart is twice as slow however, which can be
explained by the fact that the whole IP core uses only one AXI
DMA to forward its 256-bit output stream to the PS. The
Zynq-7020 High-Performance port has a maximum data width
of 64-bit, forcing the DMA to chop up every element into
smaller packets and thus take four clock cycles to transfer one
aggregated data point. An advantage however is the decreased
total hardware utilization, which is because the opportunity for
resource sharing across multiple modes of a hybrid block can
already be exploited earlier in the design process by the HLS
compiler, in contrast to block-level multi-modality.
IV. CONCLUSIONS
High gains in performance were obtained by applying the
devised image processing acceleration methodology, although
careful attention in its usage is essential. Vivado HLS provides
a balanced mix of high-level and low-level details by allowing
fine-grained optimization of hardware computations, while
still abstracting away most of the repetitive specifics of
established paradigms such as pipelining and I/O interfacing.
Designing heterogeneous FPGA systems remains intricate
however, mainly due to the inherent duality of having to
manage both hardware and software starting from a blank
slate. On the other hand, increasing the degree of automation
might adversely affect the quality of the resulting design.
Experiments on system-level acceleration of multiple
components bearing non-trivial dataflows reveal that there is
no clear-cut winner between composition at the block design
level versus virtually implementing the same concepts at an
earlier phase in HLS. Lastly, our findings on the practice of
multi-modal kernels closely match those by [2].
V. FUTURE WORK
Not all KinectFusion kernels could be adequately tested on
the FPGA due to scope constraints, which presents a concrete
possible direction for future work. Second, the implementation
on higher-end SoCs and/or a cascade of FPGAs should be
researched as well, since the combined resource utilization
makes fully off-loading KinectFusion onto the Zynq-7020
FPGA impossible. Finally, the block-level and HLS-level
dataflow variants could be treated as two ends of a spectrum;
an untested hypothesis is that a mixture of both methods might
lead to an optimum in terms of timing and area metrics.
REFERENCES
[1] C. Cadena et al., “Past, present, and future of
simultaneous localization and mapping: Toward the
robust-perception age,” IEEE Trans. Robot., vol. 32,
no. 6, pp. 1309–1332, 2016.
[2] K. Boikos and C.-S. Bouganis, “A Scalable FPGA-
based Architecture for Depth Estimation in SLAM,”
Appl. Reconfigurable Comput., 2019.
[3] M. Abouzahir, A. Elouardi, R. Latif, S. Bouaziz, and
A. Tajer, “Embedding SLAM algorithms: Has it come
of age?,” Rob. Auton. Syst., 2018.
[4] K. Rafferty et al., “FPGA-Based Processor
Acceleration for Image Processing Applications,” J.
Imaging, vol. 5, no. 1, p. 16, 2019.
[5] R. Nane et al., “A Survey and Evaluation of FPGA
High-Level Synthesis Tools,” IEEE Trans. Comput.
Des. Integr. Circuits Syst., vol. 35, no. 10, pp. 1591–
1604, 2016.
[6] R. A. Newcombe et al., “KinectFusion: Real-Time
Dense Surface Mapping and Tracking,” 2011.
[7] Q. Gautier, A. Shearer, J. Matai, D. Richmond, P.
Meng, and R. Kastner, “Real-time 3D reconstruction
for FPGAs: A case study for evaluating the
performance, area, and programmability trade-offs of
the Altera OpenCL SDK,” in Proceedings of the 2014
International Conference on Field-Programmable
Technology, FPT 2014, 2015, pp. 326–329.
[8] L. Nardi et al., “Introducing SLAMBench, a
performance and accuracy benchmarking
methodology for SLAM,” in Proceedings - IEEE
International Conference on Robotics and
Automation, 2015, vol. 2015-June, no. June, pp.
5783–5790.
[9] J. Lee, T. Ueno, M. Sato, and K. Sano, “High-
productivity Programming and Optimization
Framework for Stream Processing on FPGA,” Hear.
2018 Proc. 9th Int. Symp. Highly-Efficient Accel.
Reconfigurable Technol., pp. 1–6, 2018.
[10] O. Reiche, M. A. Ozkan, R. Membarth, J. Teich, and
F. Hannig, “Generating FPGA-based image
processing accelerators with Hipacc: (Invited paper),”
IEEE/ACM Int. Conf. Comput. Des. Dig. Tech. Pap.
ICCAD, vol. 2017-Novem, pp. 1026–1033, 2017.
[11] K. Boikos and C. S. Bouganis, “Semi-dense SLAM on
an FPGA SoC,” in FPL 2016 - 26th International
Conference on Field-Programmable Logic and
Applications, 2016.
v
Contents
1 Introduction 1
1.1 Goals and outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background and related research 5
2.1 Simultaneous Localization and Mapping . . . . . . . . . . . . . . . . . 5
2.1.1 KinectFusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Benchmarking visual SLAM . . . . . . . . . . . . . . . . . . . . . 8
2.2 Field-Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 The FPGA put into context . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 System-on-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 High-Level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.4 Designer workflow . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 SLAM on FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Dense and semi-dense SLAM . . . . . . . . . . . . . . . . . . . . 20
3 High-level synthesis design of individual kernels 23
3.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.1 Detailed algorithm description . . . . . . . . . . . . . . . . . . . 23
3.1.2 Source code, dataset and parameters . . . . . . . . . . . . . . . . 27
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Common parallel patterns and categorization . . . . . . . . . . 31
3.2.2 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.3 Efficient line buffering . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.4 Random memory access . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.5 Data type selection . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4 Implementation of KinectFusion in HLS 57
4.1 Detailed results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.1 mm2m_sample (Map) . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.2 bilateral_filter (Stencil) . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1.3 half_sample (Stencil) . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1.4 depth2vertex (Map) . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1.5 vertex2normal (Stencil) . . . . . . . . . . . . . . . . . . . . . . . 67
vi
4.1.6 track (Gather & Map) . . . . . . . . . . . . . . . . . . . . . . . . 69
4.1.7 reduce (Reduce) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1.8 integrate (Gather & Map) . . . . . . . . . . . . . . . . . . . . . . 72
4.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2.1 Evaluation of the methodology . . . . . . . . . . . . . . . . . . . 74
5 System-level acceleration of multiple kernels 77
5.1 Dataflow of KinectFusion . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1.1 Generalized problem statement . . . . . . . . . . . . . . . . . . . 80
5.2 System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.1 Hardware debugging . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2.2 Bandwidth limitations . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3 Independent coexistence of kernels . . . . . . . . . . . . . . . . . . . . . 83
5.3.1 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4 Task-level pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4.1 Intermediate output aggregation . . . . . . . . . . . . . . . . . . 92
5.4.2 Multi-modal execution . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4.3 Application to KinectFusion . . . . . . . . . . . . . . . . . . . . . 94
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.5.1 Comparison of timing and resource profiles . . . . . . . . . . . 102
6 Conclusions and future work 103
6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Bibliography 107
vii
List of Figures
2.1 Continuum of SLAM algorithms from sparse (e.g. using feature ex-
traction) to dense (e.g. using voxelated maps) [3]. . . . . . . . . . . . . 6
2.2 Part of KinectFusion’s map (right) and a slice through the volume
(left) showing truncated signed distance values, each representing a
distance F to a surface [5]. Grey voxels are those without a valid mea-
surement, and are naturally found within solid objects. . . . . . . . . . 7
2.3 System workflow of the KinectFusion method [5]. . . . . . . . . . . . . 8
2.4 Simplified overview of KinectFusion kernels. A subscript j indicates
the presence of several resolution levels, while i indicates the presence
of multiple iterations within a level. . . . . . . . . . . . . . . . . . . . . 9
2.5 Violin plots comparing four SLAM algorithms on the NVIDIA Jetson
TK1, a GPU development board [6]. Here, KF-CUDA stands for a
CUDA-implementation of KinectFusion. . . . . . . . . . . . . . . . . . . 11
2.6 (a) Sketch of the FPGA architecture; (b) Diagram of a simple logic
element. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Diagram comparing the FPGA to other processing platforms [19]. . . . 14
2.8 Functional block diagram of the Zynq-7000 SoC [22]. . . . . . . . . . . 16
2.9 Annotated photograph of the Avnet Zedboard (adapted from [28]). . . 16
3.1 Illustration of the bilateral filter, showing its edge-preserving prop-
erty [46]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Overview of KinectFusion kernels. Green shaded areas include blocks
that are executed multiple times per frame and per level; once for ev-
ery iteration i. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Screenshot of the SLAMBench2 GUI when evaluating the ’Living Room
2’ scene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Mean ATE for different configurations of KinectFusion. The cubed
numbers indicate volume resolutions, while the input FPS corresponds
to both the tracking and integration rate. . . . . . . . . . . . . . . . . . . 30
3.5 A) RGB video stream (unused). B) Latest depth map captured by the
Kinect sensor. C) Reconstructed scene using KinectFusion [37]. . . . . . 30
3.6 The Map pattern [9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.7 The Stencil pattern [9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.8 The Reduce pattern [9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
viii
3.9 The Gather (or Scatter) pattern [9]. . . . . . . . . . . . . . . . . . . . . . 35
3.10 The Search pattern [9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.11 Non-exhaustive code snippet representing a possible instance of the
Search parallel pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.12 Concept of pipelining applied to a repeated calculation called ’op’ on
a large array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.13 Effect of pipelining on the timing profile and resource utilization. . . . 40
3.14 Analysis of a pipelined Map kernel, showing the parallelized elemen-
tary operations constituting a matrix-vector multiplication. Note that
the analysis view in Vivado HLS does not clearly indicate overlapped
computation, even though it is definitely present here: a read from
and write to the streaming interface occurs at every single clock cycle
(or equivalently, control step). . . . . . . . . . . . . . . . . . . . . . . . . 41
3.15 Illustration of the Stencil parallel pattern and a corresponding buffer-
ing technique for its implementation on the FPGA. . . . . . . . . . . . . 42
3.16 Report and analysis of a naive implementation of bilateral_filter; nei-
ther line buffering nor array partitioning is applied. . . . . . . . . . . . 44
3.17 Report and analysis of an improved implementation of bilateral_filterwhich includes line buffer and memory window functionality. . . . . . 45
3.18 Array partitioning strategy for optimizing Stencil computations. Dif-
ferently colored elements need to be accessed independently and in
parallel, which is possible only by distributing them across different
instances of internal storage components. (The memory window is
fully partitioned in all dimensions.) . . . . . . . . . . . . . . . . . . . . . 46
3.19 HLS report of the fully optimized bilateral_filter kernel. . . . . . . . . . 46
3.20 Resulting BRAM instances in the HLS report for different memory
sizes in Listing 3.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.21 Kinect v2 accuracy error distribution [66]. . . . . . . . . . . . . . . . . . 54
3.22 Kinect v1 offset and precision [44]. . . . . . . . . . . . . . . . . . . . . . 54
4.1 Effect of every optimization on the timing, resource and accuracy pro-
file of mm2m_sample (Map). . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 I/O diagram of the mm2m_sample HLS kernel before and after du-
plicating its processing elements 8-fold, assuming no bandwidth bot-
tlenecks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3 Effect of every optimization on the timing, resource and accuracy pro-
files of bilateral_filter (Stencil). . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4 Exponential function approximation for the bilateral filter, with the
actual frequency (popularity) of all arguments translated to the thick-
ness of the green layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
ix
4.5 Pareto diagram of the bilateral filter’s HLS average resource usage
(not including BRAM) and measured accuracy when all eight possible
configurations of three separate optimizations are tested. One outlier
with a large error is not shown. . . . . . . . . . . . . . . . . . . . . . . . 65
4.6 Effect of every optimization on the timing, resource and accuracy pro-
files of half_sample (Stencil). . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.7 HLS performance analysis view of an unnecessarily complex division
that went unnoticed by the HLS compiler. . . . . . . . . . . . . . . . . . 67
4.8 I/O diagram of the half_sample HLS kernel before and after duplicat-
ing its processing elements 4-fold, assuming no bandwidth bottlenecks. 68
4.9 Effect of every optimization on the timing, resource and accuracy pro-
file of depth2vertex (Map). . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.10 Effect of every optimization on the timing, resource and accuracy pro-
file of vertex2normal (Stencil). Contrary to most other cases, the con-
version from floating point to fixed-point has a negative effect here. . . 69
4.11 Effect of every optimization on the timing, resource and accuracy pro-
file of track (Gather & Map). . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.12 Heatmap of the accessed pixel positions within the reference maps
relative to the corresponding regular loop over the input maps for
the first level of track. Yellow means high frequency, purple means
the opposite. The underlying data was extracted from five frames
selected over a video fragment captured at 30 FPS, and shows that
horizontal movement of up to 750 pixels per second occurred at some
point. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.13 Effect of every optimization on the timing, resource and accuracy pro-
file of reduce (Reduce). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.14 Effect of every optimization on the timing, resource and accuracy pro-
file of integrate (Gather & Map). . . . . . . . . . . . . . . . . . . . . . . . 73
4.15 Two-dimensional illustration of a frustum-encompassing block, to which
loop boundaries can safely be restricted. The green coloured blocks
represent volumetric elements that are visible from the sensor’s cur-
rent position, meaning that all yellow elements remain unchanged
during integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1 Dataflow diagram of the first five kernels of KinectFusion. . . . . . . . 79
5.2 Illustration of two generalized dataflow challenges. . . . . . . . . . . . 81
5.3 Overview of the System-on-Chip architecture for the execution of a
custom IP core. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4 System architecture when five coexisting kernels are implemented to-
gether on the FPGA. By allocating one port for every accelerator, hard
constraints on concurrent executions are avoided. . . . . . . . . . . . . 84
x
5.5 Waveforms produced by the System ILA for the vertex2normal kernel. . 87
5.6 Diagrams depicting how the five kernels should be executed in time
if the DDR access speed were unlimited. The rows correspond to ac-
celerators each managing their own DMA and PS-PL port, while the
distinct tasks are labelled with resolution levels (0 stands for 320x240,
1 for 160x120 and 2 for 80x60). . . . . . . . . . . . . . . . . . . . . . . . . 89
5.7 System ILA waveforms for bilateral_filter when it is executed alone,
releaving a strange hiccup. The vertical lines are spaced 200 ns. . . . . 90
5.8 System ILA waveforms for half_sample in the multi-frame execution.
Large-scale pauses and restarts are clearly visible, and occur presum-
ably due to the DDR controller having to operate at full capacity. The
vertical lines are spaced 1 µs. . . . . . . . . . . . . . . . . . . . . . . . . 91
5.9 Two possible solutions for intermediate output aggregation (Figure
5.2b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.10 Two possible solutions for multi-modal execution (Figure 5.2c). . . . . 95
5.11 Three different sets of paths (depicted as large arrows) that connect
components to combine using task-level pipelining. The time for one
path is estimated from the slowest block inside that path, and the
paths should be executed separately in time to enable resource shar-
ing across different modes. . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.12 System architecture that handles the multi-level dataflow challenge of
KinectFusion’s first five kernels (see Figure 5.1) completely within the
Vivado block design, leaving the HLS IP cores unchanged. AXI-Lite
control signals are omitted for clarity, and the bottleneck-inducing
streams are marked with a red data width label. . . . . . . . . . . . . . 97
5.13 Schedule to process incoming sensor frames using the improved ac-
celerators. Due to the application of task-level pipelining, all subcom-
ponents now adapt to the slowest link in the chain, which is formed
by bandwith limitations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
xi
List of Tables
2.1 Summary of 3D SLAM algorithms adapted and compared by [6]. . . . 10
2.2 A compilation of recent 3D SLAM applications involving the FPGA
taking up roles of varying importance, showing a trend of decreasing
frame rate with increasing "density". SoC (System-on-Chip) boards
always contain both an embedded CPU + FPGA. . . . . . . . . . . . . . 20
3.1 Time spent in each kernel when KinectFusion is executed on the CPU
of either a regular laptop or the Avnet Zedboard. The resulting frame
rate is determined by summing up all timings on a given platform. . . 31
3.2 Timing and resource usage for various implementations of a simple
series of arithmetic calculations. . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 Timing and resource usage for various implementations of a square
root calculation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.1 Category, I/O dimensions, estimated timing and average accuracy of
every KinectFusion kernel when it would be executed on the FPGA.
Bandwidth limitations and other external factors are not yet taken into
account, since these fall outside the scope of Vivado HLS. . . . . . . . . 58
4.2 Resource utilization estimated by HLS for every KinectFusion ker-
nel’s top function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Impact of the optimizations arising from adoption of the methodology
versus use case-specific knowledge on the estimated performance of
KinectFusion’s kernels in HLS. . . . . . . . . . . . . . . . . . . . . . . . 74
5.1 I/O characteristics of all instances of KinectFusion’s first five kernels. . 79
5.2 Time spent in each kernel as measured on both the PS and PL of the
Zedboard. Summing these values assumes that all kernels are exe-
cuted separately in time, and can be placed side by side onto the same
FPGA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3 Realized maximum I/O throughputs that conforms to HP port band-
width bounds. The data widths and elements processed per clock
cycle are measured in terms of data units meaningful to KinectFusion
(e.g. one depth value), without regard for details involving packed
structs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
xii
5.4 Comparison of timing and resource profiles after implementing mm2m_samplethrough vertex2normal as separate accelerators versus applying both
discussed multi-level dataflow techniques. . . . . . . . . . . . . . . . . 102
6.1 Time spent in each kernel when KinectFusion is executed on either the
ARM Cortex-A9 CPU or Xilinx Zynq-7020 FPGA of the embedded SoC.104
xiii
List of Listings
3.1 Code snippet representing the Map parallel pattern. . . . . . . . . . . . 32
3.2 Code snippet representing the Stencil parallel pattern. . . . . . . . . . . 33
3.3 Code snippet representing the Reduce parallel pattern. . . . . . . . . . 34
3.4 Code snippet representing the Gather parallel pattern. . . . . . . . . . . 35
3.5 Vivado HLS code to test the maximum size of a 16-bit integer array.
Data is copied in burst mode from external memory, similar to how
block-by-block processing is implemented in practice. Although the
compiler places the local array into block RAM by default, the HLS
RESOURCE directive [1] is still included for clarity. . . . . . . . . . . . 49
3.6 Vivado HLS code for a fixed-point simple pipelined arithmetic calcu-
lation, belonging to the Map pattern. . . . . . . . . . . . . . . . . . . . . 51
3.7 Vivado HLS code for a fixed-point square root calculation, belonging
to the Map pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1 Code snippet summarizing how the multi-level dataflow problem is
to be solved within Vivado HLS. . . . . . . . . . . . . . . . . . . . . . . 99
Abbreviations
ACP Accelerator Coherency Port
AXI Advanced eXtensible Interface
CPU Central Processing Unit
DDR Double Data Rate
DMA Direct Memory Access
DRAM Dynamic Random Access Memory
DSE Design Space Exploration
FIFO First-In, First-Out
FPGA Field-Programmable Gate Array
FSM Finite State Machine
GPU Graphics Processing Unit
HLS High-Level Synthesis
HP High-Performance
ILA Integrated Logic Analyzer
IP Intellectual Property
MM Memory-Mapped
PCI Peripheral Component Interconnect
PL Programmable Logic
PS Processing System
RAM Random Access Memory
SLAM Simultaneous Localization And Mapping
1
Chapter 1
Introduction
As we embark on the road towards a more autonomous world, countless challenges
and opportunities emerge in various subdisciplines of computer architecture, algo-
rithm design and electronics. One such challenge is Simultaneous Localization and
Mapping or SLAM, a relatively modern application that attempts to make a robot
aware of its surroundings. SLAM concerns the dual problem of constructing a model
of the robot’s real-world environment, while also determining the position and ori-
entation of the robot moving inside this map at the same time [2]. Many distinct
implementations of this concept exist. Dense SLAM variants, for example, distin-
guish themselves from their sparse counterparts by incorporating as much data as
possible captured by the sensors into their global reconstruction. This gives them
a considerable edge, mainly due to the fact that they create a high quality model
that is reusable across other applications as well. However, this comes at the cost
of greater computational demands [3]. On the other hand, use cases such as au-
tonomous driving, augmented reality, indoor mapping or navigation, and basically
any requirement of high-quality environmental awareness on mobile or low-power
devices, all justify why one might desire to run dense SLAM on embedded devices
as well rather than on high-end GPUs only.
The need for embedded SLAM solutions is evident. The Field-Programmable
Gate Array (FPGA), a low-power integrated circuit that is reconfigurable yet can
reach high performance and efficiency, offers a flexible hardware platform on which
to evaluate the implementation of a dense 3D SLAM algorithm. It is essentially
a large grid of elementary blocks and routing interconnects, both of which can be
reprogrammed by the designer ’on the field’. While FPGA designs are tradition-
ally developed using a hardware description language such as VHDL, we employ
the upcoming High-Level Synthesis (HLS) tools as a means to evaluate the present-
day programmability of FPGAs as well as the quality of our design methodology.
The strength of Vivado HLS is its capability to perform high-level, pragma-directed
compilation of C-code into hardware modules [4]. The concept of a System-on-Chip
(SoC) is also essential to this dissertation, which integrates a CPU and an FPGA into
2 Chapter 1. Introduction
one package. The Zedboard development board is then used to evaluate both hard-
ware and software running on the Zynq-7020 SoC.
Computer vision and signal processing are research fields that are well repre-
sented in typical FPGA applications as well as the low-level operation of SLAM
[2]. This dissertation is based on the implementation of the KinectFusion algorithm,
mainly because it is very representative of many kernels within the general context
of image processing. Both two-dimensional and three-dimensional data structures
are processed in various ways throughout the KinectFusion pipeline [5], giving rise
to a diverse exploration of possible FPGA-specific optimizations. Furthermore, it
allows for the extraction of guidelines involving the methodology for FPGA pro-
gramming. Other benefits over comparable SLAM algorithms include its relatively
low memory requirement and good accuracy [6].
1.1 Goals and outline
The aim of this master’s thesis is to provide a framework by which SLAM and by ex-
tension, image processing kernels in general, can be efficiently mapped onto FPGAs.
Beyond the exploitation of parallelism and pipelining, the full translation of software
algorithms into HLS code is often non-trivial. We also intend to explore how multi-
ple kernels with complex dataflow characteristics can be combined in hardware as to
form an efficient, large-scale pipeline consisting of functional blocks. The final goal
is to achieve a heterogeneous implementation of 3D SLAM that is as fast as practica-
ble, while investigating which concepts and techniques can be distillated in order to
create a more generally applicable methodology as a side effect.
The outline and contributions of this thesis are as follows:
• Chapter 2 reviews the background and existing literature about SLAM, FPGAs,
SLAM on FPGAs and justifies several choices made in this thesis.
• Chapter 3 delineates the methodology that was developed to deal with the ef-
fective optimization of kernels bearing different computational and data man-
agement patterns using HLS.
• Chapter 4 applies these practices to KinectFusion and evaluates the extent to
which they brought us to a satisfying solution versus how many additional
optimizations had to be applied.
• Chapter 5 explores various ways in which multi-level dataflow can be realized
efficiently, fitting together the first five kernels of KinectFusion onto the pro-
grammable logic. A comparison is made among three architectures using their
resulting timing and resource metrics.
1.1. Goals and outline 3
• Finally, Chapter 6 formulates a conclusion presenting some takeaways of our
research and opportunities for future work.
5
Chapter 2
Background and related research
2.1 Simultaneous Localization and Mapping
Simultaneous Localization and Mapping (SLAM) is an advanced computer vision
and robotic navigation algorithm that has made significant progress over the last
30 years. Its purpose is to track the state of an agent within an unknown environ-
ment, while simultaneously constructing a model of this very environment using its
sensory observations [2]. The state is typically described by its pose (position and ori-
entation), while the model essentially refers to a map, which is either a representation
of some interesting aspects (so-called features) or a dense volumetric description of
the robot’s surroundings. It is clear that both components of SLAM, being localiza-
tion and mapping, cannot be solved independently from each other. A sufficiently
detailed map is needed for localization, while an accurate pose estimate is required
to be able to reconstruct or update the map [7]. Localization is often done by means
of tracking, which compares the incoming sensor data with the map that has been
generated so far in order to create a new estimate of the current pose [3].
One of SLAM’s emerging use cases is the variety of applications in mobile robo-
tics, including but not limited to path planning, visualization, augmented reality
and 3D object recognition. In general, many situations where localization infrastruc-
ture is absent (such as indoor operation) give rise to the present-day popularity of
SLAM. The same holds true for any scenario where detailed up-to-date maps need
to be created but are not available beforehand. Cadena et al. [2] note that SLAM is a
vital aspect of robotics, and is being increasingly deployed in various real-world set-
tings that range from autonomous driving and household robots to mobile devices.
However, it is also stated that more research is needed to achieve true robustness in
navigation and perception, especially for autonomous robots that ought to operate
independently for a long time. In this sense, SLAM has not been fully solved yet,
but we note that the algorithms considered in this thesis definitely hold potential
for investigation and acceleration due to the broad applicability of their underlying
concepts.
6 Chapter 2. Background and related research
FIGURE 2.1: Continuum of SLAM algorithms from sparse (e.g. usingfeature extraction) to dense (e.g. using voxelated maps) [3].
Implementations of SLAM come in many shapes and sizes; this diversity is par-
tially illustrated in Figure 2.1. On one end of the spectrum, we have sparse SLAM
that focuses on the selection of a limited amount of features or landmarks. This has
the upside of being computationally lighter but carries the significant downside of
reducing the quality and usability of the reconstruction. On the other end, a dense
algorithm reverses these properties: its ability to generate a much higher quality
map of the environment comes at the cost of being computationally intensive. Semi-
dense visual SLAM implementations have emerged in an attempt to form a compro-
mise, although the resulting model is still incomplete with respect to the fully dense
variant as the algorithms do not deal with all of the available sensory information
[2], [8].
2.1.1 KinectFusion
KinectFusion is a real-time dense scene reconstruction algorithm, published by Mi-
crosoft in 2011. As a SLAM algorithm, it continuously updates a global 3D map and
tracks the position of a moving depth camera within this environment. Several inno-
vations were built into this system by [5]. First, it works under all lighting conditions
since only the depth data is used. Viable consumer-oriented depth sensors include
the Microsoft Kinect camera. This allows the system to work perfectly under dark
conditions as well. Furthermore, the localization step is always done with respect
to the most up-to-date global map at all times. The usage of the map, which is rep-
resented as a volume of truncated signed distance function (TSDF) values, thereby
recapitulates the information of all previous depth frames seen so far. This helps to
avoid the drifting problems commonly associated with simple frame-to-frame align-
ment.
Figure 2.2 depicts a typical volume consisting of TSDF values. Here, F is de-
fined as the signed distance to the nearest surface. Its value is positive if it is outside
of a (solid) object, and its magnitude is truncated to a fixed maximum in order to
2.1. Simultaneous Localization and Mapping 7
FIGURE 2.2: Part of KinectFusion’s map (right) and a slice through thevolume (left) showing truncated signed distance values, each repre-senting a distance F to a surface [5]. Grey voxels are those without a
valid measurement, and are naturally found within solid objects.
avoid the interference of surfaces far away from each other [5]. The global surface is
then defined as the set of points where F = 0, hence this data structure belongs to
the class of implicit surface representations [2]. A functional limitation of KinectFu-
sion is that, unlike many other SLAM algorithms [6], the global map cannot expand
at runtime because its size is predefined. This renders KinectFusion unsuited for
large-scale SLAM (on the order of 500 cubic meters or more), although the author
notes that many of the aforementioned applications do not necessarily require this
functionality.
Technical description
The overall workflow of KinectFusion is shown in Figure 2.3. From a high-level
perspective, four interconnected stages can be distinguished as follows:
1. Surface measurement. After obtaining the raw depth map captured by a Mi-
crosoft Kinect (or equivalent) camera, this preprocessing step calculates 3D
vertex and normal vector arrays at multiple resolution levels.
2. Pose estimation. The device is tracked using a variant of the Iterative Clos-
est Point (ICP) algorithm; see the original paper for its description. The live
measurement is aligned in a coarse-to-fine manner with a predicted surface
measurement, which is in turn obtained from the surface prediction phase.
3. Update reconstruction. Given an accurate pose estimate, the incoming depth
data is integrated into the volume. TSDF values within the frustum are up-
dated to accommodate for the new sensor measurement, further consolidating
the global model.
4. Surface prediction. A raycast is performed on the most up-to-date volume,
thereby producing a dense and reliable surface measurement estimate against
8 Chapter 2. Background and related research
FIGURE 2.3: System workflow of the KinectFusion method [5].
which to perform alignment in the pose estimation phase. Loop closure be-
tween mapping and localization is achieved this way [5].
Figure 2.4 shows correspondences between these high-level stages and the sub-
routines in the source code provided by [9]. Note that the system-level dataflow of
KinectFusion is much more complex in reality, and contains many interacting ker-
nels with multiple instances. These communication and replication aspects are left
out for simplicity here, but will be explored in detail in Chapter 3.
2.1.2 Benchmarking visual SLAM
Nardi et al. [6], [9] have introduced SLAMBench, a tool used to test the correctness
and performance of various 3D SLAM algorithms. Given a dataset with a ground
truth camera trajectory, the accuracy, speed and optionally the power consumption
of a specified SLAM implementation can be measured on various CPU or GPU plat-
forms. This benchmark provides an important basis by which to evaluate the ef-
fect several parameter choices, such as the resolution of the reconstruction volume
and the frame rate. Essentially, it will allow the author to deviate from the refer-
ence KinectFusion implementation whenever it is deemed useful to do so, while still
keeping track of the possible degradation in quality due to these optimizations.
Accuracy evaluation
SLAMBench allows for detailed accuracy measurements of different SLAM imple-
mentations in the form of an absolute trajectory error (ATE). At every frame during
the execution, there exists a certain error between the estimated camera position as
produced by the application under test (AUT) and the ground truth position. The
ATE, as described by [10], is a metric that serves to evaluate this discrepancy using
a scaled Euclidean distance calculation, after aligning both trajectories in a least-
squares manner. The mean ATE is then simply its average over all frames, and will
be used hereafter to quantify the accuracy of any set of parameters used as input to
KinectFusion.
2.1. Simultaneous Localization and Mapping 9
Depthinput
mm2m_sample
bilateral_filter
half_samplej
depth2vertexj
vertex2normalj
tracki,j
reducei,j
update_posei,j
integrate
raycast
Mapoutput
Poseoutput
1. Surface measurement
2. Pose estimation
3. Update reconstruction
4. Surface prediction
FIGURE 2.4: Simplified overview of KinectFusion kernels. A sub-script j indicates the presence of several resolution levels, while i in-
dicates the presence of multiple iterations within a level.
10 Chapter 2. Background and related research
Algorithm Type Required sensors YearORB-SLAM2 [11] Sparse Monocular, stereo or RGB-D 2016LSD-SLAM [12] Semi-dense Monocular 2014ElasticFusion [13] Dense RGB-D 2015InfiniTAM [14] Dense RGB-D 2015KinectFusion [5] Dense RGB-D 2011
TABLE 2.1: Summary of 3D SLAM algorithms adapted and comparedby [6].
Comparison of KinectFusion among other SLAM algorithms
The following items summarize the performance results of SLAM in literature [6]
as well as benchmarks executed by the author, in order to give a context to the per-
formance of KinectFusion. The considered algorithms are listed in Table 2.1, and a
standard (mid-to-high-end) set of parameters is used for their evaluation.
Accuracy. Figure 2.5 indicates that the trajectory accuracy of KinectFusion is
generally mediocre, although the author’s own executions have indicated that its
mean ATE is among the best as long as no loss of track occurs. A major drawback
of KinectFusion is that it tends to get lost completely during some video fragments,
causing the pose to simply stop updating midway the benchmark. High drift occurs
as a consequence, which explains the high variability when performing accuracy
measurements on different datasets.
Memory requirements. In a comparison made by Bodin et al. [6], KinectFusion
turned out to require the lowest memory size among the five recent sparse, semi-
dense and dense SLAM algorithms shown in Table 2.1. The memory usage depends
on the dimensions of the reconstruction volume, but is on the order of 50 MB for a
relatively detailed map of 2563 elements. However, it should be noted that this value
is still very high compared to typical FPGA applications, since local storage on the
FPGA is typically on the order of a few megabits. This indicates a priori that the
implementation of KinectFusion on the FPGA is likely to be a challenging task.
Speed. According to Figure 2.5, KinectFusion is faster than most of its coun-
terparts, achieving around 8 FPS on a GPU platform. This cannot simply be gen-
eralized towards heterogeneous CPU-FPGA executions, although it does provide
another hint that KinectFusion might be the most promising choice to attempt to
accelerate.
2.2. Field-Programmable Gate Arrays 11
FIGURE 2.5: Violin plots comparing four SLAM algorithms on theNVIDIA Jetson TK1, a GPU development board [6]. Here, KF-CUDA
stands for a CUDA-implementation of KinectFusion.
2.2 Field-Programmable Gate Arrays
The Field-Programmable Gate Array (FPGA) is essentially a two-dimensional grid
of reconfigurable blocks and routing channels, offering a low-volume yet highly ef-
ficient alternative to the Application-Specific Integrated Circuits (ASIC). The term
gate array refers to the fact that these elementary building blocks consist of various
logic gates, providing look-up tables (LUT), registers, full adders (FA), multiplexers,
flip-flops (FF) and more. Special blocks such as Digital Signal Processors (DSPs) are
also at the designer’s disposal: these serve to perform arithmetic operations such
as multiplications more efficiently than by merely using LUTs. A field-programmableintegrated circuit is one that can be reprogrammed on the spot as to perform almost
any hardware functionality that the user desires. Whereas ASICs have their elec-
tronic circuitry permanently ’baked’ into silicon, the FPGA’s logic can be changed
at will long after it has been manufactured precisely because its logic blocks and
interconnects are reconfigurable. The designs running on an FPGA are typically
created using a Hardware Description Language (HDL). This language allows the
user to formally describe the behaviour of digital circuits by means of specifying,
among others, how signals should be connected together and which logical oper-
ations should be performed. In the synthesis phase, this description is then trans-
formed into a list of electronic building blocks and their interconnections. After-
wards, the blocks are mapped onto the physical rectangular layout of the FPGA in
the mapping phase. Finally, the routing phase decides how to connect these placed
components. The resulting implemented design specifies exactly how each available
FPGA resource should be configured, including how the interconnections should be
routed as to connect the relevant blocks together.
As an example, Figure 2.6 (adapted from [15]) depicts the architecture of an
12 Chapter 2. Background and related research
FIGURE 2.6: (a) Sketch of the FPGA architecture; (b) Diagram of asimple logic element.
island-style FPGA. Here, another special block called the I/O block is shown, resid-
ing at the periphery of the device. These serve to provide external connections and
are necessary to communicate with the world outside of the Programmable Logic
(PL).
2.2.1 The FPGA put into context
Figure 2.7 shows a simplified comparison of how the FPGA can be situated among
the CPU, GPU and ASIC. On the left end, we find a Central Processing Unit (CPU).
This general-purpose device is clearly the very flexible with regard to programma-
bility, but it is also the least efficient one. In this context, efficiency refers to both speed
(throughput, latency) and power consumption. On the right end, we find an ASIC:
this device is the most rigid of all but also the most efficient. The logic is burned right
into silicon, which fixes its functionality permanently but allows for an extremely
low latency and energy consumption. Within a given semiconductor techology, a
much better efficiency can be achieved by ASICs relative to FPGAs since negligible
overhead exists. Since the components and interconnections are fully fixed before-
hand, their area utilization and speed metrics are much better than for comparable
designs on the FPGA. For example, [16] found that the average ratio of silicon area
required to implement circuits containing only LUT-based logic and FFs is 35.
Moving to the left on the axis generally means sacrificing efficiency, while gain-
ing the ability to easily run a wider range of applications. On the other hand, moving
to the right means giving up on ad-hoc programmability and configurability, but in-
stead gaining increased potential for high-efficiency or high-performance computa-
tion in return. For example, the Graphics Processing Unit (GPU) cannot run general-
purpose programs, although it is quite well suited for massively parallel or vector-
ized calculations thanks to its large amount of processing units. They are however
2.2. Field-Programmable Gate Arrays 13
very power-hungry, a drawback FPGAs and ASICs do not suffer from. The FPGA’s
low power consumption and hardware reconfigurability explain its growing interest
in the academic and industrial world.
Reality is of course not one-dimensional, and the FPGA has its fair share of dif-
ferences and advantages that offsets it on other axes as well, figuratively speaking.
Rather than just existing inbetween GPUs and ASICs, it is also useful to compare
the FPGA with the CPU with respect to how they process data. CPUs often have
higher clock speeds, but execute every instruction in a much less parallel fashion
than its counterparts. While the CPU has undergone many architectural improve-
ments to accelerate the execution of software, including multi-core functionality, Sin-
gle Instruction Multiple Data (SIMD) technology, instruction-level parallelism (ILP),
speculative execution and more, these extra tools are only available under specific
circumstances. Serial execution happens otherwise, which results in every data el-
ement being processed one by one. A resulting benefit is that code with a high
degree of control statements, for example with many if-then-else constructions,
are handled well by the CPU [17]. On the other hand, FPGAs provide a more direct
shortcut to hardware, as they allow for effective pipelined and dataflow-oriented
architectures to be designed and implemented for a given (fixed) algorithm. FP-
GAs provide the opportunity to spatially parallelize complex computations across
its many reconfigurable blocks and routing channels, in order to achieve a process-
ing speed many times greater than the CPU [18]. However, note that the maximum
DDR access speed, bus widths and I/O limitations still define upper limits with re-
gard to communication for both devices.
Lastly, a disadvantage of FPGAs is that, despite the rise of tools such as High-
Level Synthesis (HLS) that attempt to ease its development [4], FPGAs intrinsically
remain quite difficult to program. The mindset for FPGA development is very dif-
ferent from that of software engineering [17], which is essentially due to the design
process being multifaceted and intricate, involving both hardware and software. To
design an FPGA system is to start from a blank slate: the architecture is not fixed,
but can be changed to perform virtually any digital hardware logic the user wishes it
to. For a System-on-Chip combining both a CPU and FPGA, the situation becomes
even more involved: in addition to devising effective hardware, the designer also
has to write good software around their custom architecture, and has to ensure that
all components work well together as intended. This stands in stark contrast to reg-
ular software development that does not deal with variable architectures, such as on
a desktop or laptop CPU. In short, a high degree of technical expertise is required in
this field, although HLS can definitely be regarded as a positive evolution towards
facilitating the hardware design aspect of this two-fold development process.
14 Chapter 2. Background and related research
FIGURE 2.7: Diagram comparing the FPGA to other processing plat-forms [19].
Power consumption
Minimizing the power usage of an application is especially important in the context
of embedded devices, where SLAM is most likely to be found. In theory, the FPGA is
more energy-efficient than the CPU and GPU by design. After all, this device is clas-
sified as ’reprogrammable hardware’, meaning that its data operations are directly
encoded into hardware. The overhead for any given calculation is therefore greatly
reduced. Furthermore, high throughputs can be achieved thanks to the opportunity
for efficient pipelining, no matter how complex said string of operations.
To verify the above claims, [20] compared the energy consumption of a high-end
Altera Stratix III E260 FPGA with an NVIDIA GeForce GTX 295 GPU and quad-core
Xeon W3520 CPU. For many typical sliding window applications (which can be seen
as a subset of image processing), the FPGA turns out to be one to two orders of mag-
nitude more power efficient in terms of energy usage per frame than both the GPU
and CPU. Only in the case of a linear 2D convolution where the filtering operation
may be executed in the frequency domain as well, the GPU-FFT implementation
was able to obtain a power efficiency comparable to that of the FPGA. While both of
these devices are around a decade old, similar conclusions were drawn by [21] for
k-means clustering on more modern hardware. Here, several Xilinx Zynq FPGAs
achieved an ’FPS per Watt’ value of a factor 10 to 25 times better than the NVIDIA
GTX 980 GPU. The general trend is that GPUs can often process data faster then
FPGAs in terms of frames per second, but do so much less energy efficiently.
2.2.2 System-on-Chip
A System-on-Chip (SoC) integrates a Processing System (PS) containing the CPU
with Programmable Logic (PL) representing the FPGA onto a single device [22].
Figure 2.8 depicts the block diagram of a Xilinx Zynq-7000 SoC, where the following
relevant functional blocks can be distinguished:
• Application Processor Unit (APU): The software part of the SoC, consisting of
a dual-core ARM Cortex-A9 CPU. It is used to control the full execution of
2.2. Field-Programmable Gate Arrays 15
KinectFusion and to initiate data transfers via the AXI1 Direct Memory Access
(DMA) IP core.
• Programmable Logic (PL): The reconfigurable hardware part of the SoC, de-
rived from the Xilinx Artix-7 FPGA. It is used to run the accelerated kernels of
KinectFusion.
• General-Purpose (AXI_GP) Ports: Provides PS-PL communication with two
32-bit master and two 32-bit slave interfaces. It is used to control the IP-cores,
and its maximum estimated throughput is 600 MB/s [22].
• High-Performance (AXI_HP) Ports: Provides PS-PL communication with four
32- or 64-bit independently programmed master interfaces. It is used to trans-
fer large amounts of data via the DMA with an estimated maximum through-
put of around 1,200 MB/s [22], [24], [25].
• Central Interconnect: Connects the PL via its AXI_GP ports to the DDR mem-
ory controller, PS cache and I/O peripherals.
• DDR Controller (Memory Interfaces): Supports DDR2 and DDR3 for access to
the Dynamic RAM (DRAM), not shown on the figure.
• Programmable Logic to Memory Interconnect: Connects the PL via its AXI_HP
ports directly to the DDR controller for fast streaming (reading and writing) of
data.
The Zedboard, shown in Figure 2.9, is a low-end development board based on
the Xilinx Zynq-7000 SoC [26]. It will be used in this dissertation for the prac-
tical evaluation of several accelerated KinectFusion configurations. High-end FP-
GAs were considered as well, but were eventually deemed out-of-scope; we mostly
wanted to see how much the Zedboard is capable of already, as the Zynq FPGA is
relatively popular in image and video processing applications [27]. Furthermore,
the lower the cost of the hardware platform, the wider the range of devices and use
cases our work could be applied to.
2.2.3 High-Level Synthesis
High-Level Synthesis (HLS) represents a collection of processes that automatically
convert a high-level algorithmic description of a certain desired behavior to a circuit
specification in HDL that performs the same operation [4]. It allows the hardware
functionality of an FPGA to be specified directly by algorithms written in a software
programming language such as C or C++. HLS tools attempt to reduce time-to-
market and address the design of increasingly complex systems by permitting de-
signers to work at a higher level of abstraction. Design spaces can be explored more1Advanced eXtensible Interface, a set of protocols for inter-IP communication as adopted and de-
scribed by Xilinx in [23].
16 Chapter 2. Background and related research
FIGURE 2.8: Functional block diagram of the Zynq-7000 SoC [22].
FIGURE 2.9: Annotated photograph of the Avnet Zedboard (adaptedfrom [28]).
2.2. Field-Programmable Gate Arrays 17
rapidly this way, which is especially important when many alternative configura-
tions have to be implemented, generated and compared.
The task of automatically generating hardware from software is far from easy,
and a one-size-fits-all solution might not even exist in the same way that a fully
optimizing and/or parallelizing compiler is theoretically impossible to create. Nev-
ertheless, a wide range of different approaches exist that attempt to partially solve
the problem. Removing the burden on the user of having to reinvent the wheel is
already a great practical advantage of HLS. After all, frequently recurring concepts
such as pipelining, array partitioning and more are often already built into these
tools, ready to be used without requiring the designer to deal with its low-level de-
tails.
Xilinx’ Vivado HLS is able to synthesize a procedural description written in C,
C++ or SystemC into a hardware IP block [1], [4]. Loop unrolling, pipelining, chain-
ing of operations, resource allocation and internal array restructuring are among the
many different optimizations that can be applied during the compilation process. In
addition, support for many types of interfaces such as shared memory and stream-
ing is built-in.
2.2.4 Designer workflow
With the availability of an Avnet Zedboard containing the Xilinx Zynq-7020 PL, the
author’s toolchain of choice consists of Vivado HLS, Vivado and Xilinx Software
Development Kit (SDK) v2017.4. These three development environments together
provide an integrated design flow as follows:
• In Vivado HLS [1]:
1. Write a C/C++ function to be integrated into the hardware system. This
can be written from scratch or based on an existing reference implemen-
tation. Data type selection and interface specifications have to be consid-
ered as well.
2. Write test benches; compile, simulate and debug the algorithm to verify
its functional correctness. Return to step 1 until the output is correct.
3. Optimize the C/C++ code to make it tailored towards a useful imple-
mentation on the FPGA. One important practice here is the application of
Vivado HLS optimization directives, which automates many redundant
aspects of the optimization process. Ensure that the algorithm stays cor-
rect by performing step 2 as needed.
18 Chapter 2. Background and related research
4. Synthesize the top function into an RTL implementation. Vivado HLS
creates two variants: VHDL and Verilog, both of which ought to be fully
equivalent.
5. Analyze the reports and cycle-by-cycle computation steps of the resulting
design. Return to step 3 until satisfied. This back and forth process is part
of Design Space Exploration (DSE).
6. Optionally, verify the correctness of the RTL implementation by running
a C/RTL cosimulation.
7. Export the RTL implementation to package it into an IP block, ready to be
used in subsequent design tools.
• In Vivado [29]–[31]:
8. Create a new IP integrator block design and insert the ZYNQ7 Processing
System. This block encompasses the embedded PS functionality, while
all other IP cores around it represent what will be implemented on the
FPGA.
9. Configure the Zynq-7000 with respect to clock speeds, PS-PL communi-
cation, peripheral interfaces and more.
10. Insert your custom HLS IP core(s) into the block design while taking care
of AXI interfacing, interconnects and ports. Optionally, insert an AXI
DMA IP core which provides streaming to memory-mapped conversions
and vice versa, in order to allow streaming IP cores to efficiently access
the DRAM via HP ports.
11. Insert a System Integrated Logic Analyzer (ILA) IP core, and add de-
bugging probes to important signals. This allows for the debugging of
post-implementation designs on the FPGA device, which consumes extra
resources but no additional clock cycles.
12. Verify the design, and fix the block design if needed.
13. Perform logic synthesis and implementation. Redo step 9 if problems
arise, such as the critical path length exceeding the clock period.
14. Analyze the resource usage and timing profile of the implemented design.
If the resource usage exceeds the Zynq-7000’s maximum, consider reduc-
ing the complexity of the HLS IP core(s) and/or decreasing the number
of concurrent IP cores present in the block design at step 10. If the critical
path exceeds the clock period, return to step 9 until any timing issues are
resolved.
15. Generate the bitstream and export the hardware to Xilinx SDK.
• In Xilinx SDK [32], [33]:
2.3. SLAM on FPGAs 19
16. Create a new standalone Board Support Package (BSP) based on the pre-
viously generated hardware platform. The drivers of this BSP allow hard-
ware components to be called directly from software code.
17. Create a new C++ software application project based on the BSP. The au-
thor recommends to import the Hello World example as it contains the
necessary platform initialization and clean-up routines.
18. Configure the BSP project to include relevant libraries, and modify the
software project’s linker script to ensure the stack and heap sizes are large
enough for your use case.
19. Write C/C++ application code that executes and verifies the full system’s
functionality. Software can be debugged by setting breakpoints in SDK,
while hardware logic can be debugged using the System ILA in Vivado
as described in [31].
It is possible that the resolution of some problems in the last step extends all the
way back to step 1, in the sense that it requires a comprehensive and holistic analysis
of all aspects in the design process. One such incident might be that the integrated
system’s performance is less than expected, so that architectural decisions regarding
bandwidths, data types etc. have to be revisited. Othertimes, more fundamental lim-
itations might be encountered such as excessive resource utilization (making routing
impossible) or I/O bandwidth ceilings, which can usually not be solved without re-
assessing the initial specifications of the system.
2.3 SLAM on FPGAs
In recent years, the idea of implementing Simultaneous Localization and Mapping
on FPGAs has been explored in diverse ways. Works in literature range from two-
dimensional [34], [35] to three-dimensional [3], [7], [36]–[40] SLAM and from sparse
[7], [35], [38], [39] to semi-dense [3] SLAM, although a fully dense 3D variant such
as KinectFusion seems to be less popular in the embedded hardware community.
Furthermore, most works focus on the hardware acceleration of just specific parts
belonging to a whole heterogeneous SLAM system [36], [37], [39]. The selected
subcomponents naturally include those that the FPGA is known to be strong at in
terms of performance and efficiency. Nevertheless, the following text provides an
overview of these existing results in an attempt to pick up architectural and method-
ological clues for the as-complete-as-possible acceleration of SLAM on FPGAs.
In order to give an idea of the current state of the art, Table 2.2 summarizes most
recent works on 3D SLAM. Some authors have published improvements of their
systems over several years, in which case only the latest results are shown. A clear
20 Chapter 2. Background and related research
Reference Algorithm Type Platform(s) Speed Year[7], [41] FastSLAM 2.0 Sparse Host CPU + Arria 10 FPGA 102 FPS 2018[36] ORB-SLAM Sparse Host CPU + Stratix V FPGA 67 FPS 2018[38] VO-SLAM Sparse DE3-340 SoC 31 FPS 2015[39] EKF-SLAM Sparse Zynq-7020 SoC 30 Hz 2015[3], [42], [43] LSD-SLAM Semi-dense Zynq-7045 SoC >60 FPS 2019[37] KinectFusion Dense GTX 760 GPU + Stratix V FPGA 26-28 FPS 2015[40] ICP-SLAM Dense Zynq-7020 SoC 2 FPS 2017
TABLE 2.2: A compilation of recent 3D SLAM applications involvingthe FPGA taking up roles of varying importance, showing a trendof decreasing frame rate with increasing "density". SoC (System-on-
Chip) boards always contain both an embedded CPU + FPGA.
trade-off is visible between the performance and quality of the algorithm, in ad-
dition to the role played by the actual ’embeddedness’ of the heterogeneous plat-
forms in use. Low-power, real-time sparse SLAM applications seem to be coming of
age thanks to their light computational weight, although they do not produce a us-
able map and often lack loop-closure functionality [3]. On the other hand, the more
accurate and feature-rich fully dense solutions notably require high-end hardware
(typically desktop GPUs as shown in Figure 2.1) unless the real-time constraint is
disposed of [2].
2.3.1 Dense and semi-dense SLAM
Perhaps the most relevant entry in Table 2.2, the implementation of real-time 3D re-
construction using KinectFusion on a heterogeneous system with an FPGA has been
attempted before by [37]. Here, Gautier et al. attempted to accelerate two intensive
parts of the application, specifically the Iterative Closest Point (ICP) algorithm and
the volumetric integration step, corresponding to the track, reduce and integrate ker-
nels in Figure 2.4. Their set-up was a heterogeneous system with the Altera Stratix
V FPGA and the NVIDIA GTX 760 GPU, both connected via PCI Express to a host
computer. It is interesting to note that the authors’ goal of accelerating integrate was
unsuccessful due to a fundamental bandwidth limitation. A 3D volume with 5123
elements takes 512 MB of space in memory which is far too huge to transfer back
and forth between the FPGA and the GPU (or CPU for that matter) at sufficient
speeds. Their final architecture therefore consists of KinectFusion being executed
nearly completely on the GPU, but with the ICP part of the tracking step offloaded
to the FPGA. Real-time speeds of up to 28 FPS were achieved by halving the input
data resolution and limiting the number of ICP iterations. Lastly, Gautier et al. point
out that the Altera OpenCL SDK posed practical difficulties in optimizing area uti-
lization, for example because the tool lacked support for fixed-point arithmetic.
The fact that semi-dense and dense SLAM algorithms are characterized by high
bandwidth requirements was also noted by [42] in 2016. In this work, Boikos et
2.3. SLAM on FPGAs 21
al. presented their first iteration of LSD-SLAM, achieving 4 FPS at a resolution of
320x240. Two accelerator units were implemented, and the communication between
them had to occur via DDR because the intermediate data (on the order of a few
MB) produced by the first unit could not be cached entirely on the FPGA. A re-
designed architecture around the tracking core to enable the usage of a full stream-
ing communication paradigm was presented in [43], bringing a five-fold frame rate
improvement over the previous work. Combined with a scalable depth estimation
architecture again by Boikos et al. in [3], 2019 marked the arrival of the first com-
plete accelerator for direct semi-dense monocular SLAM on FPGAs. The power con-
sumption was measured to be an order of magnitude smaller than that of an Intel
i7-4770 CPU. A highly important takeaway is that the dataflow principle (i.e. ker-
nels linked with a single-consumer, single-producer pattern) was found to yield the
most efficient design. Furthermore, the units were made multi-modal in order to
deal with LSD-SLAM’s complex control flow due to the iterative and multi-level na-
ture of tracking. More specifically, the pipelined hardware blocks could be put to
different uses depending on the current phase of the system: every unit contains a
set of operations from which the desired computation can be selected by means of
multiplexing. As will be explained in Chapter 5, similar techniques were employed
in our research for the dataflow architecture of KinectFusion.
23
Chapter 3
High-level synthesis design ofindividual kernels
The transformation of KinectFusion’s source code into Vivado HLS-optimized code
is an important aspect of this thesis, not just to obtain a HLS implementation of
SLAM but also in an attempt to recognize patterns in the approach by which it is
done. Many subroutines part of a dense 3D SLAM algorithm often recur in other ap-
plications related to computer vision and image or video processing as well, because
the data management structure of its kernels only varies so much in practice. Conse-
quently, the task of accelerating these kernels can largely be mapped to a framework
that we developed for the purpose of making the design of similar HLS kernels eas-
ier in the future. Before proceeding to this methodology in Section 3.2, some details
about the specific use case are discussed first as to give a better idea about the char-
acteristics and diversity of kernels that are being dealt with.
3.1 Prerequisites
3.1.1 Detailed algorithm description
KinectFusion, like any SLAM algorithm, is composed of various steps that have to be
executed in succession for each captured depth frame. A diagram of KinectFusion’s
nine essential kernels is depicted in Figure 3.2. Note that some routines are called a
variable amount of times per frame, and that multiple instances of each kernel might
be called with mutually different sets of dimensions. This complex interaction and
dataflow will be discussed in detail in Chapter 5; a functional description of each
kernel follows here.
• mm2m_sample: This kernel essentially resamples the sensor output and per-
forms a unit conversion from millimeters to meters. The raw depth frames
captured by the Kinect camera are given in an unsigned short integer format,
which need to be converted to a floating point representation and resized (if
necessary) to the correct dimension by subsampling the pixels. This allows all
subsequent kernels to work with distance values expressed in meters only.
24 Chapter 3. High-level synthesis design of individual kernels
• bilateral_filter: Because the Kinect depth map is rather noisy [44], the data is
first filtered in an edge-preserving way. This kernel is a non-linear filter that
replaces each depth value by a weighted average of nearby values in order to
reduce the noise amplitude. Its smoothing operation relies on the prior knowl-
edge that many real-world environments consist of large patches of mostly flat
areas, such as shown in Figure 3.1. The algorithm clearly preserves discontinu-
ities, which is achieved by just barely weighing intensities that are ’far away’
from the currently considered intensity in an absolute value sense. Newcombe
et al. [5] have found that the insertion of this preprocessing stage greatly
increases the quality of the normal maps produced in later stages of the al-
gorithm, which in turn improves the data association step performed by the
tracking kernels.
• half_sample: This kernel resamples the bilaterally filtered depth map by a
factor of two in each dimension. Every four input values are thus mapped to
one output value, again with an edge-preserving effort as not to introduce any
fake averaged depth values near discontinuities.
• depth2vertex: This kernel computes a point cloud in projective form, i.e. with
every vertex as a function of its pixel position. By multiplying every depth
value with a matrix that summarizes the camera’s intrinsics [45], an output
array is produced where each element represents a Euclidean 3D point in the
local frame of reference.
• vertex2normal: Given the map of vertices generated previously, this kernel
produces an array of normal vectors. Every normal vector is calculated by
taking the cross product of two subtracted pairs of neighboring points.
• track: This kernel performs part of the multi-scale Iterative Closest Point (ICP)
alignment. It essentially tracks the live depth frame against the globally fused
model, in order to establish correspondences between the new and synthetic
point clouds. No feature selection occurs as KinectFusion is a fully dense al-
gorithm. A faster than conventional variant of ICP is employed by [5] as well
as the source code discussed later. As long as the frame rate remains suffi-
ciently high, this optimization is made possible thanks to the assumption of
small motion from one frame to the next.
• reduce: This kernel calculates the total error of the tracking result, by adding
up distances between corresponding points in the input and predicted point
clouds. 32 values are obtained to form a basis for the error minimization pro-
cess.
• update_pose: This routine produces a new or refined pose estimation, starting
from the reduction computed above.
3.1. Prerequisites 25
FIGURE 3.1: Illustration of the bilateral filter, showing its edge-preserving property [46].
• check_pose: Since the correction of the pose estimate should only be applied
when the resulting tracking error is small enough, this routine verifies the re-
duction output and resets the pose to its last sufficiently stable estimate if nec-
essary.
• integrate: Once the estimated pose has been updated, this kernel integrates
the newly observed depth map into the global 3D map. This volume consists
of Truncated Signed Distance Function (TSDF) values, whereby each element
has an associated weight corresponding to the certainty of the surface mea-
surement at that position. The integration step transforms the input depth
map into a world frame of reference, and iterates over the whole volume to
update each element by computing a simple running average of the existing
TSDF value and the (possibly noisy) new TSDF value. In order to maximize
the system’s ability to reconstruct finer scale structures, the raw depth map is
used for this purpose rather than the bilaterally filtered version [5].
• raycast: This kernel generates a synthetic vertex and normal map by casting
a ray from every pixel into the fully up-to-date dense 3D volume, given the
current pose estimate. It therefore generates a reliable prediction of what the
corresponding input arrays should look like if the camera would make its ob-
servation at the specified position. This step is essential in forming a reference
that allows the tracking kernels to take into account all previous observations
made so far.
The reduce kernel is followed by update_pose and check_pose, but these are left out
of the diagram because they contain ’irregular code’ that is inherently unfit for the
FPGA. The justification for this arises from the absence of any significant oppor-
tunity for parallelization, pipelining, or temporally efficient hardware sharing. In
update_pose, a vector of 6 values is computed by means of singular value decomposi-
tion, after which the updated pose matrix is generated by calculating an exponential
map from a Lie algebra to the group of rigid transformations in 3D space [47]. Imple-
menting these operations on the FPGA takes up a large fraction (around one third)
26 Chapter 3. High-level synthesis design of individual kernels
Depthinput
mm2m_sample bilateral_filter half_sample1 half_sample2
depth2vertex0 depth2vertex1 depth2vertex2
vertex2normal0 vertex2normal1 vertex2normal2
tracki,0
reducei,0
update_posei,0
tracki,1
reducei,1
update_posei,1
tracki,2
reducei,2
update_posei,2
integrate
raycast
Mapoutput
Poseoutput
Inter-framestorage
2D array (level 0)
2D array (level 1)
2D array (level 2)
3D volume
Small signal
FIGURE 3.2: Overview of KinectFusion kernels. Green shaded ar-eas include blocks that are executed multiple times per frame and per
level; once for every iteration i.
3.1. Prerequisites 27
of the Zynq’s resources. This indicates that the hardware blocks are being used in-
efficiently, as the code is irregular. Only if the matrix dimensions would be much
larger or the routine would be called much more often than it is right now, would
we be able to exploit repetitiveness and consider FPGA acceleration. The second
method, check_pose, is more related to control flow than actual computation, making
the overhead of off-loading this method to the FPGA undesirable. Lastly, both rou-
tines take relatively little processing time. All of the discussed factors have led the
author to the decision to execute update_pose and check_pose on the ARM Cortex-A9
CPU only, regardless of which other kernels are being off-loaded to the FPGA.
3.1.2 Source code, dataset and parameters
Reference implementation
A C++ implementation of KinectFusion is provided by SLAMBench [6], [9], [48],
[49], which is in turn based on the CUDA implementation by [50]. We have decided
to completely rewrite KinectFusion based on the source code found in these GitHub
repositories, for the following reasons. First, out-of-scope or platform-specific fea-
tures such as comprehensive benchmarking, multi-core functionality, graphics ren-
dering1, user interfaces, extensive I/O support and more can easily be omitted this
way to avoid interference with relevant features and improvements. Second, by fix-
ing loop bounds as much as possible, the methods become better suited for FPGA
acceleration. Variable loop bounds are difficult to reconcile with certain HLS opti-
mizations such as loop unrolling [1]. Moreover, the loop control can be simplified
if the HLS compiler knows the number of iterations beforehand, thus saving on re-
sources. Third, library dependencies and C++11 features are avoided as much as
possible, in order to increase the code portability. It has been experimentally de-
termined that Vivado HLS and Xilinx SDK do not fully support C++11-specific fea-
tures, so this step allows for the ARM-compilation and execution of KinectFusion
kernels on the Zedboard without issues. TooN [51] is the only remaining external
library, and is used for the complex linear algebra calculations performed in up-date_pose. Coincidentally, this library does not need to be ported to FPGA hardware
as update_pose was already declared uneligible for acceleration. Lastly, the existence
of a clean reference implementation allows for a more straightforward comparison
between generic (CPU) and FPGA-specific variants of the kernels. Henceforth, the
terms ’original’ and ’reference implementation’ will always refer to respectively [9]
and the rewritten version of KinectFusion as described in this paragraph.
1After all, visualization is just one of the possible use cases of SLAM as previously noted in Chapter2. The removal of this feature makes our implementation less opinionated, in the sense that it allowsfor any other application to substitute for it instead.
28 Chapter 3. High-level synthesis design of individual kernels
FIGURE 3.3: Screenshot of the SLAMBench2 GUI when evaluatingthe ’Living Room 2’ scene.
Dataset choice and data extraction
There is little difference in KinectFusion’s performance or accuracy across different
datasets. In order to measure kernel timings in the reference implementation, the
author has rather arbitrarily chosen the ’Living Room 2’ RGB-D video fragment be-
longing to the ICL-NUIM dataset [45]. This scene has a resolution of 640x480 which
corresponds to what the Kinect v1 sensor captures, and is even slightly higher than
the depth map resolution of the Kinect v2 camera (512x424) [52]. An example of
SLAMBench’s visualization is shown in 3.3. The original implementation was mod-
ified so that intermediate data could be extracted from in-between every block in
Figure 3.2, with the goal of thoroughly testing every HLS kernel’s accuracy relative
to a reliable ground truth. Using test benches in Vivado HLS, the exact deviations
caused by optimizations (such as moving from floating point to fixed-point arith-
metic) or other changes were evaluated precisely this way.
Parameter space exploration
While the specific dimensions and parameters of our system do not matter much in
the construction of a generally applicable methodology, it is useful to get an idea of
the orders of magnitude that are being considered. The important factors influencing
this decision include accuracy, speed and usefulness of the resulting application. Us-
ing SLAMBench [6], the mean ATE of KinectFusion’s original implementation was
calculated for 64 parameter configurations by varying the following four settings:
• Input resampling factor (= 1, 2, 4 or 8): Specifies the resizing ratio of the cap-
tured sensor frame. If it is larger than one, then mm2m_sample will perform re-
sampling in order to downscale the input depth map from 640x480 to 320x240,
3.1. Prerequisites 29
160x120 or 80x60.
• Global volume resolution (= 256, 128, 64 or 32): The size of the reconstruction
array along one dimension. The upper boundary of 2563 corresponds to 16
million weighted TSDF elements in total.
• Tracking rate (= 1 or 2): The frame interval by which to perform the pose track-
ing step. If the input is captured at 30 FPS but the tracking rate is halved (i.e. its
value equals 2), then the pose would only be updated once every second frame.
Consequently, the actual processing rate of the algorithm would be limited to
15 FPS in real-time as half of the input frames effectively remain unused.
• Integration rate (= 1 or 2): The frame interval by which to perform the volu-
metric integration step. As it makes no sense to perform integration with an
outdated pose, this value must be equal to or higher than the tracking rate.
It was quickly found that setting the volume resolution lower to 64 or lower almost
always leads to loss of track, effectively rendering the system useless at perform-
ing localization or mapping. Excluding these values then results in Figure 3.4. Re-
markably, the algorithm performs well at all depth map resolutions. Halving the
volume resolution doubles the mean ATE on average, although the input resam-
pling factor does not have a significant influence as long as its value does not exceed
4. For reasons that we could not explain, the ATE for one of the 16 configurations
(640x480, 2563, 15 FPS) skyrocketed during the benchmark. This phenomenon was
also present in ’Living Room 0’, and we speculate that the algorithm simply experi-
ences bad luck for this particular configuration. Errors tend to grow near the end of
the scene, although the first half often works flawlessly.
Note that the meaning of FPS in this context is unrelated to the actual perfor-
mance of the system that runs the algorithm. Instead, it designates the rate at which
the sensor provides its input frames, or in other words, the frequency at which we
decide to skip these frames. Reducing the input FPS from 30 to 15 lowers the bar
(i.e. the system’s minimal workload) of achieving real-time performance by a factor
of two, while sacrificing just a small amount of accuracy. If the heterogeneous sys-
tem is able to process frames at 15 FPS or higher, then we can talk about real-time
performance as long as every second sensor frame is disposed of. [5] have shown
that even using every 6th sensor frame only may result in reconstructions of lesser
but nonetheless acceptable quality, a concept also known as graceful degradation.
The real-world size of the volume is 8 × 8 × 8 m3, yielding a smallest step of
3 cm along one dimension in the case of a volume with 2563 elements. For use
cases that require the ability to distinguish objects in sufficient detail, the author
declares a lower accuracy than that unacceptable (see Figure 3.5 for an example with
30 Chapter 3. High-level synthesis design of individual kernels
FIGURE 3.4: Mean ATE for different configurations of KinectFusion.The cubed numbers indicate volume resolutions, while the input FPS
corresponds to both the tracking and integration rate.
FIGURE 3.5: A) RGB video stream (unused). B) Latest depth mapcaptured by the Kinect sensor. C) Reconstructed scene using Kinect-
Fusion [37].
a resolution of 512), so the volume resolution is kept at 256. However, the input
resampling factor is set to 2 as both the mean ATE and the global reconstruction
quality are still satisfactory. Furthermore, the decrease from 30 to 15 FPS also has
negligible effect on the accuracy but obviously a considerable positive effect on the
speed of the system. The final selected configuration is as follows: an input size of
320x240, a volume resolution of 256, a tracking rate of 2 and an integration rate of
2. In theory, real-time performance can be achieved as soon as the actual processing
rate becomes 15 FPS or more.
CPU timing results
With the parameters determined above, the reference implementation of KinectFu-
sion was timed on both a laptop computer and the CPU of the Zynq SoC with com-
piler optimizations enabled. Table 3.1 summarizes the total time spent within each
3.2. Methodology 31
Kernel Intel i7-6700HQ CPU ARM Cortex-A9 CPUmm2m_sample 0.4 ms 2.7 msbilateral_filter 58.2 ms 425.9 mshalf_sample 0.6 ms 1.8 msdepth2vertex 3.0 ms 7.9 msvertex2normal 7.8 ms 27.4 mstrack 20.9 ms 125.7 msreduce 35.1 ms 27.7 msintegrate 205.0 ms 1236.1 msraycast 233.0 ms 1293.6 msTotal speed 1.77 FPS 0.32 FPS
TABLE 3.1: Time spent in each kernel when KinectFusion is executedon the CPU of either a regular laptop or the Avnet Zedboard. Theresulting frame rate is determined by summing up all timings on a
given platform.
kernel. No multi-core functionality was used, because all software is executed as
standalone (bare metal) on the Zedboard. Without any operating system, multi-
threading is too cumbersome to implement and therefore left out-of-scope for this
dissertation.
3.2 Methodology
3.2.1 Common parallel patterns and categorization
When presented with the question of how to accelerate image processing kernels,
one could answer that it is important to first understand which type of kernel is be-
ing considered. In the domain of parallel programming, a very useful categoriza-
tion happens on the basis of the kernel’s parallel pattern. This concept assigns every
method to a kind of predefined algorithmic skeleton, based on its data management
and/or computation pattern [53].2 The design methodology can then be tailored to-
wards the optimization of every category separately, which allows the kernels to be
tackled more easily after performing this subdivision. As will become clear, every
pattern has its own set of challenges and approaches during its implementation -
this holds true not just for GPUs but for FPGAs and other hardware as well.
On a wording-related side note, the following paragraphs will often talk about
images and pixels when referring to (two-dimensional) arrays and elementary data
values respectively. The reader should keep in mind that the elements can be com-
posed of any data type that is not necessarily limited to regular pixels. Many pat-
terns as well as their discussed methods can even be generalized towards three-
dimensional data structures as well. The aforementioned terms will still be used for2As such, the patterns are not really parallel in the sense that the FPGA will not parallelize them the
way a GPU does, but we still call them that for convenience.
32 Chapter 3. High-level synthesis design of individual kernels
FIGURE 3.6: The Map pattern [9].
for (int y = 0; y < HEIGHT; y++) for (int x = 0; x < WIDTH; x++)
// Read an input valuedata_t value_in = data_in[y][x];// Calculate some function of (value_in, x, y) to produce value_outdata_t value_out = f(value_in, x, y);// Write the output valuedata_out[y][x] = value_out;
LISTING 3.1: Code snippet representing the Map parallel pattern.
conciseness and simplicity. Next, the pseudo-code examples serve to illustrate the
computation pattern of each type, and do not yet take into account any efficiency
and/or HLS optimization measures. Lastly, the approach towards FPGA accelera-
tion is already summarized very briefly for each category, but will be explained in
full detail in the next section.
Map
The Map parallel pattern is arguably the simplest pattern with regard to image pro-
cessing. As illustrated in Figure 3.6, This type of kernel independently maps every
input pixel to a corresponding output pixel, without using the value of any other
input or output pixel in its calculation. The pseudo-code for this pattern can be
summarized as in the code snippet in Listing 3.1.
The characteristic aspect of this category is that, for every iteration, the calcu-
lation depends only on a single input pixel that is never used again in any other
iteration. This means that the input data is scanned in a linear manner, i.e. from
the beginning to the end. This fact can be readily exploited in FPGA-oriented opti-
mization stages, both computation- and communication-wise. Example applications
of the Map category include affine transformations such as pixel inversions or unit
conversions.
This pattern is a perfect candidate for pipelining, since all iterations are indepen-
dent of each other in terms of data. In HLS, a streaming interface should be applied
to the kernel in order to allow an AXI DMA to send and receive data at high speeds.
3.2. Methodology 33
FIGURE 3.7: The Stencil pattern [9].
#define WIN_SIZE (HALF_SIZE * 2 + 1) // WLOG: always oddfor (int y = 0; y < HEIGHT; y++)
for (int x = 0; x < WIDTH; x++) // Read multiple input valuesdata_t window[WIN_SIZE][WIN_SIZE];for (int i = -HALF_SIZE; i <= HALF_SIZE; i++)
for (int j = -HALF_SIZE; j <= HALF_SIZE; j++) if (0 <= y + i && y + i < HEIGHT && 0 <= x + j && x + j < WIDTH)
window[i + HALF_SIZE][j + HALF_SIZE] = data_in[y + i][x + j];
// Calculate some function of (window, y, x) to produce value_outdata_t value_out = f(window, y, x);// Write the output valuedata_out[y][x] = value_out;
LISTING 3.2: Code snippet representing the Stencil parallel pattern.
Loop unrolling should be no problem either, if permitted by the FPGA resources and
PS-PL bandwidth capacity.
Stencil
The Stencil parallel pattern shown in Figure 3.7 is similar to Map, but every output
pixel now depends on a set of multiple neighboring input pixels instead of just a sin-
gle one. This dependence very often comes down to a predefined moving window,
in which case the output calculation depends only on input pixels within this win-
dow. Listing 3.2 shows a possible pseudo-code representation, in which the sliding
window is fully loaded with input data at the start of every new iteration.
Note that in the given example code, border handling effects are left to the user.
They should always be taken care of adequately in order to prevent accidental reads
from invalid (out-of-bounds) memory addresses. This type of kernel is characterized
by the explicit reuse of input data; every pixel is usually read multiple times and is
involved in the calculation of multiple output pixels. The reuse factor often equals
the squared window dimension (WIN_SIZE2) for a standard convolution with non-
zero coefficients, although this is not necessarily the case in general. Applications of
34 Chapter 3. High-level synthesis design of individual kernels
FIGURE 3.8: The Reduce pattern [9].
reduction_t result;for (int y = 0; y < HEIGHT; y++)
for (int x = 0; x < WIDTH; x++) // Read an input valuedata_t value_in = data_in[y][x];// Update the temporary variable holding a partial aggregationresult = aggregate(result, value_in);
return result;
LISTING 3.3: Code snippet representing the Reduce parallel pattern.
the Stencil category include for example two-dimensional linear or non-linear filters.
Clearly, the access pattern to data_in is more complicated than for the Map pat-
tern, which makes the application of streaming interfaces non-trivial. Furthermore,
bottlenecks can arise very quickly upon employing the pipeline paradigm without
careful optimization. The technique to resolve this issue will be to insert line buffers
containing several rows of pixels, along with array partitioning directives in order
to ensure the possibility of concurrent accesses to the window and line buffers while
maintaining a minimized initiation interval (II).
Reduce
The Reduce parallel pattern (Figure 3.8) is a many-to-one operator and aggregates all
input elements into a single output value. This kernel type is typically programmed
as in Listing 3.3.
The lack of an output image and the combination aspect is characteristic for this
kernel. Naturally, multiple output values (which can be seen as one large data type)
are also allowed. However, the output size essentially remains constant and does
not scale with the input size. FPGA-wise, the acceleration of this pattern is very
similar to Map. The biggest difference is the absence of an output stream: instead,
the result will be exposed as a simple memory-mapped register for the PS to read
from upon completion.
3.2. Methodology 35
FIGURE 3.9: The Gather (or Scatter) pattern [9].
for (int y = 0; y < HEIGHT; y++) for (int x = 0; x < WIDTH; x++)
// Retrieve the input position (y_in, x_in)int y_in = f(y, x);int x_in = g(y, x);// Read the input valuedata_t value_in = data_in[y_in][x_in];// Calculate some function of (value_in, y, x, y_in, x_in) to produce value_outdata_t value_out = h(value_in, y, x, y_in, x_in);// Write the output valuedata_out[y][x] = value_out;
LISTING 3.4: Code snippet representing the Gather parallel pattern.
Gather
The Gather parallel pattern, shown in Figure 3.9, introduces an unseen complexity
in the form of random data access. Instead of scanning both the input and output
arrays regularly, they are accessed at irregular indices. These positions can depend
on a limited number of environmental parameters, but also on the data itself. Listing
3.4 shows an example whereby the reads from the input image occur in an irregular
fashion, while the output image gets written to in a linear (and thus fully predictable)
manner.
The characteristic aspect of this category is the presence of ’random’ memory
accesses (it does not matter whether this happens at the input side or the output
side). Here, the term random does not necessarily involve statistical randomness, but
also includes irregular, parameter-dependent and/or content-dependent accesses
that cannot be fully predicted by the programmer without knowing more about the
context during execution. This type of kernel is often found in combination with
the Map pattern, whereby an auxiliary or reference array is the one that gets ac-
cessed at irregular positions in order to retrieve extra data, while the main input
and output arrays are still mapped in a linear order. While the Stencil and Gather
categories concern different patterns, overlap sometimes exists for example in the
tracking step of KinectFusion. The irregularly accessed positions might not be uni-
formly random, but instead remain somewhat confined (either deterministically or
statistically) within an area around the 2D position of the corresponding ’linear’ loop
36 Chapter 3. High-level synthesis design of individual kernels
FIGURE 3.10: The Search pattern [9].
determined by (y, x). Application-specific knowledge and analysis of spatial local-
ity will be very useful in determining which possible solution to pick.
In general, the Gather pattern is certainly non-optimal when it comes to fitness
for FPGA acceleration due to the irregular memory accesses. After all, local buffer-
ing of all image data is often unfeasible due to the limited amount of Block RAM (on
the order of several megabits) that is present on an average FPGA. One technique
would be to cache different parts of the data, each piece as large as possible, and
re-execute the kernel as many times as needed in order to fully process all elements.
Another technique is to treat the kernel as a Stencil pattern, and employ a fall-back
procedure to external DDR requests if a specific position happens to fall outside
the current line buffer. Clearly, both techniques are only applicable in limited cases
however, which will be discussed in Section 3.2.4.
Other
Many remaining parallel patterns exist, such as Search in Figure 3.10 and Scatter in
Figure 3.9. The former type retrieves data based on content matching [53] and is
embodied by the raycast kernel in KinectFusion, albeit for a huge volume of data.
The latter category can be contrasted to Gather, as Scatter kernels writes their output
data to random locations rather than reading from them. Due to a mixture of scope
constraints and lack of FPGA-relevant methods in the considered SLAM use case,
both patterns were not studied in detail in this dissertation. For the Search kernel,
we refer to a dynamic variant of the scratchpad buffering technique that is nomi-
nally tailored towards Gather, while a Scatter kernel could be treated as belonging
to the Map category since external memory writes are non-blocking operations (un-
less read-after-writes occur). For completeness, a pseudo-code example of Search
is shown in Listing 3.11. A similar computation is performed for every 2D pixel in
raycast.
3.2.2 Pipelining
The concept of pipelining is of primary importance for the acceleration of compu-
tationally intensive kernels of any kind. This holds true especially in the world of
hardware design, because it allows for a much more time-efficient usage of all re-
sources required for the calculation. Pipelining can essentially be summarized as:
the ability to start executing the next iteration of a computation before the previous
3.2. Methodology 37
step_t t = 0;data_t val = 0;while (true)
// Retrieve the position at which to inspect dataint y = f(t, val);int x = g(t, val);// Read the valuevalue = data[y][x];// Check whether the condition we are looking for is satisfiedif (found(t, x, y, val))
return (t, x, y, val);// Increment tt += step_size(t, x, y, val);
FIGURE 3.11: Non-exhaustive code snippet representing a possibleinstance of the Search parallel pattern.
one has finished [54]. It can be compared with a factory assembly line, whereby
parallelism that exists among the different steps is exploited in order to overlap the
execution of multiple iterations. Since any given loop body generally consists of
various operations that require distinct sets of resources, the synthesized design will
often be suboptimal as long as this opportunity for resource sharing is not taken
advantage of. The manner in which speed-ups can be obtained through pipelining
is visualized in Figure 3.12. Here, the initiation interval (II) is defined as the rate at
which a new iteration starts, measured in clock cycles. The iteration latency indicates
the actual duration of one such iteration, which can perfectly be many times longer
than the initiation interval thanks to the opportunity for overlapped computations
provided by pipelining.
Even though the clock frequency of an FPGA is typically one order of magni-
tude lower than a CPU, it will produce higher throughputs for the great majority of
kernels. Furthermore, the FPGA is very data-dependent, making a steady stream of
input and output data ideal when said data has to undergo a long string of muta-
tions in a pipelined fashion. A useful analogy would be to view the FPGA as a set
of hardware blocks operating on the data elements flowing through them. On the
other hand, a CPU could be viewed as a small set of registers being operated by the
instructions flowing through the CPU one by one. In short, the notion of pipelining
is intrinsically very well suited to the FPGA, and in general there exist few reasons
not to apply this concept to methods that contain repetitive calculations over large
bodies of data. An initiation interval of one clock cycle is the fastest possible, and is
often what the designer ought to aim for in practice.
38 Chapter 3. High-level synthesis design of individual kernels
(A) Execution on a CPU, which has limited opportunities for parallelization or higher-level pipelining.
(B) Execution on an FPGA with pipelining.
FIGURE 3.12: Concept of pipelining applied to a repeated calculationcalled ’op’ on a large array.
Practically, the loops contained within an HLS kernel can easily be pipelined
thanks to optimization directives that are readily available in Vivado HLS. By in-
serting a pragma in either the loop body’s source code or a separate configuration
file, the programmer can provide a hint to the HLS compiler that he or she wishes
to pipeline the respective loop. The compiler will then insert registers between mul-
tiple hardware operations and reschedule them as needed, so that no interference
occurs between data elements present at different stages of the calculation. This
way, instead of sequentially executing every iteration in isolation, the initiation in-
terval can effectively be reduced to just one or several clock cycles. Whereas the
iteration latency had previously directly determined the initiation interval, this ini-
tial latency has become far less important as the resulting throughput in regime is
now defined by the pipelined initiation interval only. For maximum performance, it
is also recommended to convert external arrays into a streaming interface if possible.
This enforces all reads and writes to happen in a fully linear and predictable manner,
which better suits the pipelining paradigm.
All discussed categories including the Map, Stencil, Reduce, Gather and Search
parallel patterns ought to benefit from the insertion of a pipelining pragma in the in-
nermost pixel loop (i.e. the horizontal loop if the processing order is row-major). As
an example, its effect on a Map kernel called depth2vertex is discussed here, which
3.2. Methodology 39
processes 320 × 240 = 76800 pixels at the resolution pyramid’s finest level. Syn-
thesizing the HLS top function while employing a streaming paradigm but without
applying any pipelining optimization yields Figure 3.13a. In this case, the initiation
interval of every loop equals its iteration latency, being five clock cycles. However,
applying the HLS pipeline directive to the for_x loop results in Figure 3.13b, where a
speed-up of a factor five is visible. The initiation interval is now exactly one, which
is the fastest possible: one result is produced and written to the output stream at
every clock cycle. Note that the iteration latency has increased (HLS likely had to
insert an extra register somewhere in the chain of hardware operations to maintain
conformance with dependency and timing constraints), but this change has no real
importance whatsoever compared to the drastic decrease in the initiation interval.
To explain why the resource utilization has increased just barely despite this five-
fold speedup, the more detailed analysis perspective provided by HLS is depicted in
Figure 3.14. The analysis views for the non-pipelining variant look extremely simi-
lar and are therefore not shown. One can visualize pipelining as creating copies of
every blue rectangle and shifting them to the right by intervals of one control step,
although Vivado HLS does not draw it in that manner presumably to preserve clar-
ity. A control step can be identified with one of different states of a finite state machine
(FSM), although it corresponds to a single clock cycle in the majority of cases. The
considered kernel multiplies every incoming data element with a 3x3 matrix, which
corresponds the numbered operations 27 to 43. If pipelining were not applied, then
the FPGA is still be able to exploit the intrinsic parallellism of a matrix-vector multi-
plication, however the many DSPs responsible for this calculation would be utilized
inefficiently throughout time. The overhead of the loop, whose iterations are exe-
cuted sequentially in time, would cause the DSPs to be activated only once or twice
every five clock cycles. Applying the pipelining concept results in these same DSPs
now having a much higher duty cycle: 100 % to be precise. Unlike loop unrolling
or similar optimizations, the relative increase in resource usage is generally much
smaller than the performance gain factor obtained by applying the pipeline pragma.
To conclude, our FPGA acceleration of this routine consists of parallelization on one
hand and pipelining on the other, made possible respectively thanks to the kernel
computation’s inherent regularity and the overlapped usage of DSPs.
3.2.3 Efficient line buffering
For kernels that incorporate any degree of data reuse, more advanced techniques
than streaming and pipelining have to be utilized. Figure 3.15a depicts the typical
functionality of a kernel belonging to the Stencil category. Hardware convolutions
and stencil kernels alike have been studied extensively in literature [20], [21], [55]–
40 Chapter 3. High-level synthesis design of individual kernels
(A) HLS report (no pipelining).
(B) HLS report (with pipelining).
FIGURE 3.13: Effect of pipelining on the timing profile and resourceutilization.
[58], although no clear description exists on their implementation in HLS3. While not
a full-fledged tutorial, this paragraph attempts to partially fill this gap while placing
specific attention on handling feedback from the tool as this is what constitutes a
methodology.
A naive implementation of the bilateral_filter kernel is shown in Figure 3.16a. An
initiation interval of one clock cycle was not achieved, and the Vivado HLS console
displays blue messages as in Figure 3.16b to indicate why that is the case. The in-
spection and resolution of these warnings is an important aspect of the HLS design
process, and is part of the DSE philosophy. At the start of every iteration, the win-
dow is completely refilled with 3× 3 = 9 depth values read straight from the input
array. However, these 9 accesses cannot happen concurrently within one clock cy-
cle, which forces HLS to automatically increase the pipelined initiation interval to at
least 9 cycles. This is confirmed by the analysis view in Figure 3.16c, which visualizes
which operations occur in the synthesized design at every control step. This initial
configuration is sub-optimal not only due to the presence of recurring requests for
the same data elements, but also due to the irregularity of the memory accesses. The
required pixels are generally not available in memory precisely in the order that they
3The Window and LineBuffer data structures provided by Xilinx were found to lack the necessaryfeatures. Moreover, documentation on non-separable 2D filters is scarce: the only relevant sourceswould be [59], [60], although a different loop structure was used in this thesis.
3.2. Methodology 41
(A) Performance analysis view (with pipelining).
(B) Resource analysis view (with pipelining).
FIGURE 3.14: Analysis of a pipelined Map kernel, showing the paral-lelized elementary operations constituting a matrix-vector multipli-cation. Note that the analysis view in Vivado HLS does not clearlyindicate overlapped computation, even though it is definitely presenthere: a read from and write to the streaming interface occurs at every
single clock cycle (or equivalently, control step).
42 Chapter 3. High-level synthesis design of individual kernels
(A) General operation of the Stencil kernel, shown with a 3x3 window as an ex-ample [56].
(B) Principle of the line buffer and shifting window operation,visualized onto the input image.
(C) Interaction between the line buffer and memory window, visualized as how they are actually struc-tured in memory (adapted from [55]).
FIGURE 3.15: Illustration of the Stencil parallel pattern and a corre-sponding buffering technique for its implementation on the FPGA.
are read and processed, so that the FPGA will have to ’gather’ them and continually
make external DDR requests to the DRAM. Such requests inherently have a high la-
tency, reducing the real-world throughput even further compared to the HLS report.
A major improvement would be to access every data element exactly once, and
locally cache them as long as needed to allow for all stencil computations requiring
that particular pixel to be completed from data residing on the FPGA only. This
way, efficient streaming interfaces can be implemented for the incoming data. Fur-
thermore, we aim to store as little data as possible in order to optimize for BRAM
utilization as well. Figures 3.15b and 3.15c illustrate the principle of a window and
a line buffer, which are important two data structures that serve to fulfill this goal.
3.2. Methodology 43
Denote the image height by H, its width by W and the window dimension by N. At
every iteration, the whole memory window of size N × N is shifted left and the line
buffer subsequently fills N − 1 values of its rightmost column. In addition, a single
value is read from the input stream and placed in the window as well as the line
buffer. The line buffer of size (N − 1)×W holds N − 1 full rows of pixels. There-
fore, it has a width equal to that of the whole image, but experiences a vertical shift
of just one column per iteration so that the net result after one full row of iterations
will be a full upward shift as well.
The sizes of both data structures correspond to the amount that is minimally re-
quired while still allowing for every stencil computation to proceed efficiently start-
ing from this locally stored data only. A large square window of e.g. 15x15 would
require no more than 14 lines to be cached at any point in time, taking up space
on the order of 20 to 100 KB. For an image width of 640 pixels and a data size of 4
bytes per pixel, the Zynq-7020’s BRAM could theoretically store up to 162 lines (see
Section 3.2.4) together with a window of 163x163. Of course, a kernel actually using
data of such magnitude will likely exceed the FPGA’s resource utilization due to its
computational weight long before it approaches these internal memory limitations.
In practice, none of the considered routines in KinectFusion (or most other use cases,
to our knowledge) come anywhere near this size. Transposing the image in order to
exploit lower image heights and thereby allow for even larger window sizes is there-
fore not necessary.
Figure 3.17 shows the HLS report, console and resource analysis views after im-
plementing the aforementioned buffering technique. Unfortunately, the design is
still flawed: this time due to internal memory congestion rather than PS-PL inter-
facing constraints. The more precise reason can be found in the fact that a single in-
stance of FPGA’s dual-port Block RAM does not have enough ports to output all the
values that are needed at the desired rate. Furthermore, a shifting window requires
many simultaneous writes to the same array per iteration as well. The following
operations should be allowed to happen ideally within one clock cycle:
• N2 reads from the memory window to be forwarded to the stencil calculation.
• N2 writes to the memory window, N(N − 1) of which consist of left-shifted
values, N− 1 of which are copied from the line buffer and the last one of which
is read from the external stream.
• N − 1 reads from the line buffer to be forwarded to the window.
• N − 1 writes to the line buffer, N − 2 consist of up-shifted values and the last
one of which equals the single new value received from the input stream.
44 Chapter 3. High-level synthesis design of individual kernels
(A) HLS report displaying a high initiation interval.
(B) Example of warning messages shown by Vivado HLS to inform the designer about possible bottle-necks.
(C) Performance view displaying concurrent accesses to the input image marked in red, with feedbackto the HLS source code.
FIGURE 3.16: Report and analysis of a naive implementation of bilat-eral_filter; neither line buffering nor array partitioning is applied.
3.2. Methodology 45
(A) HLS report still displaying a high initiation interval.
(B) Warning messages shown by Vivado HLS, revealing internal memory constraints.
(C) Resource view displaying concurrent accesses to the memory window, from which a new bottle-neck arises.
FIGURE 3.17: Report and analysis of an improved implementation ofbilateral_filter which includes line buffer and memory window func-
tionality.
Note that the above list is not part of the stencil calculation itself, and exists purely
to manage the content of the buffer correctly. Similar to [25], we separate the com-
putation from the data movement to keep things simple. In case of a 3x3 window,
N2 + N − 1 = 11 reads from and N2 + N − 1 = 11 writes to the BRAM occur for
every iteration. Interdependencies between both data structures result in a slightly
higher initiation interval of 12 clock cycles.
In order to resolve the bottleneck, the back-end memory storage for the window
and line buffer need to be partitioned into several BRAM instances or even reg-
isters. Vivado HLS provides an effective optimization directive to do exactly that
[1]. Multi-dimensional arrays can be distributed across multiple memory cells such
as individual registers or BRAM banks by applying the HLS ARRAY_PARTITION
46 Chapter 3. High-level synthesis design of individual kernels
FIGURE 3.18: Array partitioning strategy for optimizing Stencil com-putations. Differently colored elements need to be accessed indepen-dently and in parallel, which is possible only by distributing themacross different instances of internal storage components. (The mem-
ory window is fully partitioned in all dimensions.)
FIGURE 3.19: HLS report of the fully optimized bilateral_filter kernel.
pragma. The dimension(s) across which to partition and a cyclic factor can be spec-
ified as well, although only the former parameter is relevant in our discussion. Fig-
ure 3.18 depicts the minimally required degree of partitioning necessary to allow for
both data structures to perform without forming bottlenecks. Finally, Figure 3.19
shows the report whereby all discussed issues have been resolved. Resource utiliza-
tion has increased significantly, which is explained by the fact that the bilateral filter
performs relatively expensive operations on every pixel in its window. Previously,
this per-pixel computation could be spread over 12 clock cycles (actually forming
another pipeline within the pipelined loop), but as of now N2 − 1 = 8 instances (the
center pixel can be optimized away in this particular case) have to start in parallel
at every clock cycle. The latency of this operation is 7 cycles, so by Little’s Law we
have that 56 such computations are in progress at any given point in time during the
kernel’s execution.
3.2.4 Random memory access
Some kernels have to read from, or write to, a large array that resides outside the
FPGA such as in DRAM. The presence of such irregular accesses often creates perfor-
mance bottlenecks on the FPGA. However, this fact is usually not visible in Vivado
HLS because it assumes that every request gets answered immediately within one
3.2. Methodology 47
clock cycle. Simply put, I/O performance analysis falls outside the scope of Vivado
HLS as this tool is only concerned with everything that happens inside the FPGA.
The full complexity of system-level interactions between the PS, PL and DRAM can
only be assessed accurately by performing real-world executions, which will be ex-
plored in Chapter 5. Note that the term random is interchanged with unpredictable in
this context: the positions at which specific elements are needed can depend either
on a limited set of parameters given to the subroutine, but also on the content of an-
other data stream. The default scenario is to leave all data on the DRAM, and let the
FPGA periodically make DDR requests whenever it needs one or more elements. In
this case, an AXI-master protocol should be applied so that the PS (representing the
outside memory) acts as a slave with the PL as its master. Due to the intrinsic high
latency of performing random address lookups, this scenario is quite slow by de-
sign. Assuming that the required data does not fully fit on the FPGA’s local storage,
two possible solutions were devised:
1. Block-by-block processing. This technique locally buffers smaller sections of
the whole body of data, and restarts the subroutine as much as needed in order
to adequately process all elements. These sections could either overlap or not;
the former case is preferred to limited the number of re-executions, but the
latter might prove more efficient when a cohesive neighbourhood of multiple
elements from the external array is often needed. The prerequisite for this
technique is a kernel that can easily be re-executed multiple times, and whose
input and output streams are of the same type. The input data that cannot yet
be handled correctly in the first execution would otherwise be lost forever. In
addition, a special flag has to be introduced for each element of the stream to
indicate that some elements were not yet processed due to inavailability of the
data required for the calculation of that specific element, but should be handled
in a next iteration instead. The amount of re-executions needed equals:
⌈Nn
⌉Here, N is the full size of the randomly accessed data in bytes, and n denotes
the size of the internal array containing one block of this data. A code example
of this method is found in integrate.
2. Intelligent buffering. In some cases, it is possible to execute the component
just once by buffering the data in a smart way so that most or all of it is already
locally available. Practically, this idea will result in code similar to how Stencil
patterns are handled, although we prefer to use the term scratchpad memoryrather than a line buffer here. At every moment in time, this scratchpad should
aim to cover areas of the randomly accessed data where it is most likely to be
needed, which often requires deleting and refilling portions of this internal
48 Chapter 3. High-level synthesis design of individual kernels
store throughout the whole execution of the program. Locality of reference
can readily be exploited using this method. If a requested position happens to
reside outside of the buffered region of interest (i.e. a cache miss occurs), then
a fall-back procedure can consist of either making a DDR request or doing
nothing and returning a (possibly incorrect) default value for that element.
The latter is only possible if the system can tolerate a small amount of invalid
data, while the former requires a statistical analysis to ensure that the cache
miss frequency remains low enough to warrant the additional overhead. A
code example of this approach is found in track.
The selection among these solutions depends on various factors, and both of them
cannot be applied in all situations. It may even be the case that the only feasible
configuration is the default one, which does not copy data and uses DDR requests
instead. Reasons might include the conditions mentioned above not holding, or sim-
ply due to the lack of a significant improvement in the resulting performance.
In order to substantiate the discussed techniques and give an idea of when they
might be relevant, the maximum size of a single contiguous array residing in the
Zynq-7020’s Block RAM was experimentally determined using the code in Listing
3.5. Block RAM units come in sizes of 36 Kib containing two independently con-
trolled 18 Kib RAMs [61], and HLS reports the BRAM usage in equivalents of 18
Kib. An array of 16-bit integers will cause the BRAM to be configured so that one
instance can effectively hold 1024 integers. The function was synthesized in HLS for
different values of BYTES, resulting in Figure 3.20. A peculiar behaviour is visible:
the tool always seems to round up BRAM utilization to the next power of two, an
observation also made by [3]. This means that an array of 512 KiB or 524 288 bytes
will result in 256 BRAM instances for memory storage, but any more than that will
cause at least 512 instances to be generated, which is far beyond the 280 BRAMs
offered by the Zynq-7020 PL. An important consequence is that the maximum size
of a single array is effectively limited by this rounding behaviour rather than by the
precise amount of BRAMs offered by the FPGA.
In KinectFusion, the block processing technique is used during integration since
the main input and output arrays are 3D volumes there and thus of the same type.
See Section 4.1.8 for a further discussion. The intelligent buffering method will prove
to be very useful in the tracking phase. Because physical camera motion between
subsequent frames can be assumed to be small [5], a high degree of spatial locality
across multiple 2D arrays is present in the ICP part of the algorithm. This is dis-
cussed in more detail in Section 4.1.6.
3.2. Methodology 49
#define BYTES 524288#define COUNT (BYTES / 2)int32 max_memory(const int16 source[COUNT]) #pragma HLS INTERFACE m_axi depth=64 port=source
int16 local[COUNT];#pragma HLS RESOURCE variable=local core=RAM_1P_BRAM
memcpy(local, source, BYTES);// Do something with the dataint32 res = 0;for_i : for (int i = 0; i < COUNT; i++)
#pragma HLS PIPELINE II=1res += local[i];
return res;
LISTING 3.5: Vivado HLS code to test the maximum size of a 16-bitinteger array. Data is copied in burst mode from external memory,similar to how block-by-block processing is implemented in practice.Although the compiler places the local array into block RAM by de-
fault, the HLS RESOURCE directive [1] is still included for clarity.
FIGURE 3.20: Resulting BRAM instances in the HLS report for differ-ent memory sizes in Listing 3.5.
50 Chapter 3. High-level synthesis design of individual kernels
3.2.5 Data type selection
Vivado HLS supports the usage of arbitrary bit sizes for integers and fixed-point
numbers [1]. The philosophy behind this freedom is that every data type should
only be assigned as many bits as actually needed. This way, the designer can pre-
vent the unnecessary allocation of FPGA resources for the calculation of bits that are
not useful enough to warrant the cost, or might even end up not being used at all.
On the other hand, floating point representations are only available in half- (16 bit),
single- (32 bit) and double-word (64 bit) sizes.
Several aspects contribute to the choice of data types for any given calculation.
A trade-off in several dimensions between resource cost and accuracy often has to
be made. When real numbers are required, the decision between floating point and
fixed-point representations turns out not to be so evident. In conjunction with this
aspect, the precise bit widths at the algorithm’s disposal naturally have to be consid-
ered as well. The following paragraphs serve to motivate a substantiated decision
between the two alternatives in the context of 3D SLAM. First, two general use cases
will be considered:
1. An IP-core that performs four basic arithmetic operations on every data ele-
ment in a pipelined manner;
2. An IP-core that calculates the square root of every data element in a pipelined
manner.
The goal is then to test the effect of different data types on the FPGA’s resource
usage and latency. Note that I/O-bandwidth considerations might also play a role;
this aspect will be concentrated on in Chapter 5.
Comparison between floating point versus fixed-point
The notion that floating point operations are considered to be expensive to imple-
ment on reconfigurable hardware is backed by [62] as well as practical measure-
ments. The other option is to use fixed-point numbers, where the decimal point is
placed at a fixed position within the bit representation. These provide the opportu-
nity to perform arithmetic operations in a way that is much less costly for the FPGA’s
resources, as will be illustrated below. On the other hand, the biggest drawback of
employing fixed-point representations is the lack of a high dynamic range. Floating
point numbers retain sufficient accuracy for both extremely small and large values,
while fixed-point numbers are only practical within selected orders of magnitude
due to the presence of a fixed step size and a relatively low maximum value. This
limitation notably led [3] to the decision of keeping the computation in the floating
point domain.
3.2. Methodology 51
typedef ap_fixed<32, 16> data_t;for (int i = 0; i < N; i++) #pragma HLS PIPELINE II=1
data_t value_in = stream_in.read();data_t tmp = value_in + data_t(2.0);tmp = tmp * data_t(3.0);tmp = tmp - data_t(5.0);tmp = tmp / data_t(7.0);stream_out.write(tmp);
LISTING 3.6: Vivado HLS code for a fixed-point simple pipelinedarithmetic calculation, belonging to the Map pattern.
Any application should ideally execute its whole pipeline either in the float-
ing point domain or in the fixed-point domain as to avoid the cost of conversion.
However, given that we are currently in the process of accelerating only subsets of
the algorithm, conversions from floating point numbers to their equivalent fixed-
point representation will have to be taken into account in the following measure-
ments. This is because the non-accelerated parts of KinectFusion will run on a CPU,
where floating point representations are naturally the best choice. After all, the ARM
Cortex-A9 does not provide native support for fixed-point operations; neither does
the C++ language. Consider the HLS kernel in Listing 3.6, which consists of a series
of basic operations. Here, one addition, one subtraction, one multiplication and one
division are performed, each with a constant second operand. The loop is pipelined
to simulate a scenario whereby data is streamed and processed efficiently with an
initiation interval of one clock cycle.
Three cases are tested: first, the kernel is synthesized with all operations in the
32-bit floating point domain. Second, it is resynthesized in the 32-bit fixed-point do-
main by changing all data types; this is the case shown in the code snippet above.
Finally, the stream data types are changed to floating point again while keeping the
calculation within the fixed-point domain. As such, the overhead due to the added
float-to-fixed and fixed-to-float conversion steps is investigated. Table 3.2 summa-
rizes the resulting HLS reports after synthesis on a Zynq xc7z020clg484-1. A second
experiment was done with the arithmetic operations replaced by a single square
root. As is commonly done, the Xilinx HLS library implementations of sqrt and
sqrtf are used to ensure a hardware-optimized design. Table 3.3 summarizes the
resulting HLS reports, and a representative code snippet is shown in Listing 3.7 for
completeness.
Several interesting observations can be made. First, it is evident that most (but
not all) timing and resource metrics improve significantly if the same operations
52 Chapter 3. High-level synthesis design of individual kernels
Calculation architecture Iteration latency DSP FF LUTFloating point (32-bit) 33 cycles 7 1582 2259Fixed-point (32-bit) 7 cycles 9 605 621Fixed-point (32-bit) with conversions 18 cycles 9 1659 2540
TABLE 3.2: Timing and resource usage for various implementationsof a simple series of arithmetic calculations.
typedef ap_fixed<32, 16> data_t;for (int i = 0; i < N; i++) #pragma HLS PIPELINE II=1
data_t value_in = stream_in.read();data_t value_out = hls::sqrt(value_in);stream_out.write(value_out);
LISTING 3.7: Vivado HLS code for a fixed-point square root calcula-tion, belonging to the Map pattern.
are performed using fixed-point representations over floating point. This can be
explained by the fact that arithmetic operations are complex to implement in the
floating point domain, but in the fixed-point domain they are essentially just glori-
fied integer operations. A drop of a factor 3 to 4 in resource usage is therefore not
uncommon.
Second, this fact cannot simply be generalized to all other types of operations
such as the square root. One reason is that the Zynq-7000 FPGA has native support
for the floating point square root calculation in the form of a hardware core called
FSqrt [1]. The analysis view in HLS confirms that a simple opcode is used in this
case, while for fixed-point data types an external function is called with an imple-
mentation provided by Xilinx. Hence, care should be taken in selecting the optimal
data type when a square root operation is present. Alternatively, optimized algo-
rithms such as the LogiCORE IP core described in [63] could be used. This product
is however not further explored in this thesis, due to the lack of an evaluation license
for academic purposes.
Third, in terms of resources, the considered series of four arithmetic operations
happens to quantitatively coincide with a break-even point. The decision of whether
Calculation architecture Iteration latency DSP FF LUTFloating point (32-bit) 15 cycles 0 559 779Fixed-point (32-bit) 22 cycles 0 2166 5810Fixed-point (32-bit) with conversions 32 cycles 0 3058 7404
TABLE 3.3: Timing and resource usage for various implementationsof a square root calculation.
3.2. Methodology 53
to stay within the floating point domain, or whether to perform the operations using
fixed-point representations in conjunction with a two-fold conversion, therefore de-
pends on the complexity of the considered algorithm relative to this use case. In the
context of this dissertation, all kernels in KinectFusion except for mm2m_sample have
a much higher computational complexity so that the latter option is expected to be
preferred in the majority of cases. In general, we have found that fixed-point oper-
ations are more efficient with respect to resources and, as a result, area and power
usage as well. This finding is also supported by [64] and [65] among others. To our
knowledge, the 32-bit square root operation seems to be only exception to this rule
on this specific hardware platform.4 It is also noted that the better accuracy provided
by floating point representations is often not needed for many applications, as long
as the dynamic range of the used fixed-point data type is sufficient.
Bit sizes and application context
Next to resource usage, another drawback of floating point operations is that their
availability is restricted to 16-bit, 32-bit and 64-bit formats only. In contrast, hard-
ware designers are free to allocate any amount of bits to the integer and fractional
parts of fixed-point arithmetic. While the comparisons in Tables 3.2 and 3.3 were
made with 32-bit fixed-point calculations as to match the bit sizes of the correspond-
ing floating point numbers, the actual optimum strongly depends on the use case.
Depth value representation. The latest Microsoft Kinect sensor has a best-case
depth measurement accuracy error of less than 2 millimeters [66]. As shown in Fig-
ure 3.21, the accuracy worsens as we move either further away from the camera or
deviate from the center of the image. While the Kinect v1’s depth map acquisition
works by a fundamentally different principle5, similar conclusions have been drawn
regarding its precision in [44], [52]. In addition to the measurements experiencing
noise with a standard deviation on the order of a few millimeters, the offset (bias)
strongly increases as a function of the distance as well, which is plotted in Figure
3.22. The most important KinectFusion kernel in terms of resource utilization that
operates on these depth values is the bilateral filter. This operation is quite expen-
sive, leading us to the choice of employing a fixed-point representation with a frac-
tional part of 10 bits. As a result, the smallest increment equals 1000/210 ≈ 1 mm,
although no better accuracy is expected to be necessary according to the argumen-
tation above. Since both Kinect cameras cannot see beyond 8 to 10 meters [52], the
4Even though the exponential and logarithmic functions also have designated hardware cores, nei-ther resource profile clearly dominates the other when comparing floating point versus fixed-pointimplementations.
5The Kinect v1 projects an infrared light pattern and estimates the 3D geometry of a scene by an-alyzing the projected image. On the other hand, the Kinect v2 is a Time-of-Flight (ToF) sensor whichacquires the depth map by emitting modulated square waves and measuring the per-pixel phase dif-ference of the reflection [52].
54 Chapter 3. High-level synthesis design of individual kernels
FIGURE 3.21: Kinect v2 accuracy error distribution [66].
FIGURE 3.22: Kinect v1 offset and precision [44].
integer part has to be at least 4 bits. Our final choice in this work for the representa-
tion of a depth value in meters is a 16-bit unsigned fixed-point number, assigning 10
bits to the fractional part and 6 bits to the integer part.
Other geometric value representation. 3D vertices, normal vectors, matrices
and Euclidean (squared) distances constitute the many other quantities that need to
be processed on the FPGA as well. In order to ensure sufficient accuracy for kernels
where fine-grained differences in distances (such as error accumulations) matter a
lot, we decided to assign a more generous 32-bit fixed-point representation to all of
these remaining values with the integer and fractional parts each taking up 16 bits.
This way, only a limited amount of bug-prone rescaling is needed when operating on
small and/or large numbers such as in the depth2vertex, vertex2normal, track, reduceand raycast kernels.
3.3 Summary
In this chapter, a detailed algorithmic description of a known 3D dense SLAM appli-
cation was given. Afterwards, techniques were introduced to effectively accelerate
3.3. Summary 55
image processing methods with commonly occurring computation or data manage-
ment patterns using high-level synthesis. This methodology involves both algorith-
mic and structural changes to the code as well as the straightforward insertion of
pragmas. Some of these optimization directives thoroughly transform the design
while abstracting away many of the underlying hardware details, especially with
respect to managing cycle-by-cycle execution steps. Vivado HLS’ reports and anal-
ysis views were shown to be a very useful tool in order to gain an understanding of
what happens behind the scenes. The concepts of streaming interfaces, loop pipelin-
ing, data buffering, array partitioning and data type decisions were studied compre-
hensively. The presence of irregular memory accesses is a more complicated matter
however, as the strategy by which to tackle these kernels is much more context-
dependent. The next chapter will evaluate the application of all studied techniques
to KinectFusion step-by-step.
57
Chapter 4
Implementation of KinectFusion inHLS
Having reviewed a generally applicable methodology for the acceleration of im-
age processing kernels with various parallel patterns, we now move to the concrete
world of KinectFusion. In the following sections, HLS implementations of all ker-
nels are covered separately and we discuss the effect of the applied techniques as
well as some extra application-specific optimizations. Table 4.1 summarizes what
is achieved in Vivado HLS for each KinectFusion kernel. The parallel pattern was
inferred by inspecting the source code. The time spent is an estimate calculated by
multiplying the timing of a single run by the average number of re-executions per
frame. Next, the accuracy was computed by comparing every HLS kernel’s output
with a ground truth obtained via the original KinectFusion implementation, as de-
scribed in Section 3.1.2. Unless noted otherwise, every method converts its input
from floating point to fixed-point and performs the reverse operation on the output.
This is done to ensure that all data communicated outside of the FPGA remains read-
able by the CPU, while also saving on resources for computations inside the FPGA;
see Section 3.2.5 for a grounded argumentation on this. Note that the raycast kernel
was successfully implemented in C++ but its HLS optimization was not explored in
this thesis, due to a mixture of scope constraints and the Search category’s inherent
unsuitability. The higher degree of control as well as the inherently unpredictable
amount of iterations, irregular data positions and very low data reuse all make it a
priori problematic to accelerate this component adequately on an FPGA.1
1Even if we just consider data movement and ignore all other problems: the block-by-block process-ing method is not applicable because it fundamentally conflicts with how the Search pattern operates.Intelligent buffering faces the issue of deciding which parts to copy to the scratchpad and how tostructure the memory layout efficiently. As even a partial volume is too large for most FPGAs, finergranularity is needed which in turn requires advanced geometric calculations to determine the bound-aries of these smaller blocks. In short, going this path would quickly threaten the overhead to increaseeven beyond the regular case of not employing any buffering technique at all.
58 Chapter 4. Implementation of KinectFusion in HLS
Kernel Category Input Output HLS timing Errormm2m_sample Map 2D 2D 0.38 ms 0 %bilateral_filter Stencil 2D 2D 0.77 ms 0.03 %half_sample Stencil 2D 2D 0.25 ms 0.01 %depth2vertex Map 2D 2D 1.0 ms 0.09 %vertex2normal Stencil 2D 2D 1.0 ms 0.02 %track Gather & Map 2D 2D 23 ms 1.13 %reduce Reduce 2D 32 values 3.3 ms 0.47 %integrate Gather & Map 2D + 3D 3D 0-168 ms 0.32 %
TABLE 4.1: Category, I/O dimensions, estimated timing and averageaccuracy of every KinectFusion kernel when it would be executed onthe FPGA. Bandwidth limitations and other external factors are notyet taken into account, since these fall outside the scope of Vivado
HLS.
Kernel BRAM [%] DSP [%] FF [%] LUT [%]mm2m_sample 0 0 5 13bilateral_filter 0 21 10 19half_sample 1 2 4 17depth2vertex 0 4 3 11vertex2normal 6 16 18 35track 89 57 43 42reduce 1 38 4 11integrate 100 74 48 93
TABLE 4.2: Resource utilization estimated by HLS for every Kinect-Fusion kernel’s top function.
4.1. Detailed results 59
Unroll loop
Apply pipelining
Apply streaming
Original code0 %
0 %
0 %
0 %
6 % LUT
4 % LUT
4 % LUT
13 % LUT
24.6 ms
25.3 ms
3.07 ms
0.38 ms Estimated timingMax. resource usage
Mean error
FIGURE 4.1: Effect of every optimization on the timing, resource andaccuracy profile of mm2m_sample (Map).
4.1 Detailed results
4.1.1 mm2m_sample (Map)
The mm2m_sample kernel passes through a selected subset of all input pixels, while
dividing them by 1000. It is kind of an exception to the earlier categorization of
parallel patterns, but for the purpose of our research, this method most closely be-
longs to the Map class. This pattern is handled by introducing streaming interfaces
and pipelining the computation. It is also the only individually accelerated IP core
where the author has chosen not to perform any conversion between floating point
and fixed-point representations. Because the calculation itself is so light-weight com-
pared to the overhead of an extra conversion step, the total resource usage and la-
tency turns out to be lower when this conversion is left out. Figure 4.1 captures the
effect of every optimization separately on the HLS report, when the target device
xc7z020clg484-1 is selected with a clock period of 10 ns. The selected parameter
configuration is described in Section 3.1.2, so the raw sensor input gets downsam-
pled from 640× 480 to 320× 240. Every change is in accordance with the general
methodology that has been introduced over the previous sections, and is explained
in more detail below:
1. Original code. The function was copied straight from the reference imple-
mentation without any changes except for the introduction of an interfacing
protocol. AXI-master interface directives were added so that the IP core is able
to communicate with the Zynq’s DRAM via a master-slave link. The iteration
latency is 32 cycles, and only a quarter of the input pixels are read from the ar-
ray. The resulting execution time is therefore estimated to be 32× 320× 240×
60 Chapter 4. Implementation of KinectFusion in HLS
10 ns = 24.6 ms, although this does not take into account the unpredictability
of DDR requests to outside memory.
2. Apply streaming. The interface was changed to an AXI-streaming protocol,
which enables the usage of a Zynq High-Performance port in conjunction with
the AXI Direct Memory Access (DMA) IP core. Furthermore, Xilinx’ HLS
stream data type in C/C++ enforces a single read, single write pattern over
the whole input and output maps. On the other hand, a drawback arises since
the kernel is forced to read every input pixel, regardless of whether that ele-
ment will be processed or not. The iteration latency varies from 3 to 24 clock
cycles while iterating over all elements of the input array: 3 if it does not pass
through (which is 75 % of the time on average), and 24 if it does pass through
and thus gets rescaled by 1/1000. Following the report, the resulting execution
time is equal to (0.75× 3 + 0.25× 24)× 640× 480× 10 ns = 25.3 ms.
3. Apply pipelining. The loop body was pipelined with an initiation interval of
one clock cycle. The iteration latency stays unchanged, but since a new divi-
sion now starts (and ends) every four cycles, a speed-up of a factor 8.25 has
effectively been obtained. The increase in resource usage is negligible, which
means that no extra hardware blocks had to be allocated in this transformation.
This can be explained by the fact that the existing resources are now used much
more efficiently, whereas previously the majority of them were only active ev-
ery once in a while. The resulting execution time is 640× 480× 10 ns = 3.07 ms
according to HLS.
4. Unroll loop. As there is a huge degree of spare room for extra computation,
the loop body was manually unrolled by a factor of 8.23 One might expect
that transforming the design as duplicate its kernel calculation 8-fold would
increase the resource utilization by a factor of 8 as well. However, due to the
nature of downsampling and the horizontal loop unrolling, the routine only
has to perform 4 divisions per iteration rather than 8 within every second row
of pixels. Essentially, 4 divisions happen at 50 % of the iterations on average,
whereas previously just 1 division occurred at 25 % of the iterations. This is
illustrated in Figure 4.2. The HLS report confirms that the number of FFs and
LUTs allocated for the division have increased approximately 4-fold; in addi-
tion, they are now used more efficiently. Although there is still much room
left in terms of resources, we decided not to explore further duplication of
processing elements due to bandwidth constraints. Both streaming data types
2This decision for this value was made from a trade-off between (expected) bandwidth limitationsand available hardware resources.
3Duplication of processing elements has to be done manually, since HLS does not allow multiplereads from or writes to a stream construct within the same clock cycle. Therefore, the struct data typecontained by the stream has to be updated as well.
4.1. Detailed results 61
(A) Before unrolling.
(B) After unrolling.
FIGURE 4.2: I/O diagram of the mm2m_sample HLS kernel beforeand after duplicating its processing elements 8-fold, assuming no
bandwidth bottlenecks.
have a width of 128 bit, holding either 8 unsigned short integers or 4 float-
ing point values. The AXI_HP port supports a maximum throughput of 1,200
MB/s, although the discussed configuration already assumes 1,600 MB/s at its
input and output streams at a clock period of 10 ns. No further real-world per-
formance improvement can therefore be expected by increasing the unrolling
factor. The theoretical timing of this kernel is 640/8× 480× 10 ns = 0.38 ms
although we expect something closer to 0.51 ms; a more accurate performance
analysis follows in Chapter 5.
Note that the error stays extremely small since the calculation stays in the floating
point domain at all times. Due to its relative simplicity, this method was the only
one where we found that adding conversions to fixed-point arithmetic increased
resource usage instead of decreasing it.
62 Chapter 4. Implementation of KinectFusion in HLS
Weaken exp. func.
Use fixed-point
Reduce window size
Buffering & streaming
Apply pipelining
Original code0.004 %
0.004 %
0.004 %
0.030 %
0.035 %
0.035 %
11 % LUT
31 % LUT
262 % LUT
89 % LUT
29 % DSP
21 % DSP
1109 ms
23.8 ms
0.78 ms
0.77 ms
0.77 ms
0.77 ms Estimated timingMax. resource usage
Mean error
FIGURE 4.3: Effect of every optimization on the timing, resource andaccuracy profiles of bilateral_filter (Stencil).
4.1.2 bilateral_filter (Stencil)
The bilateral_filter in KinectFusion is a non-linear, non-separable filter with a 5x5
window size whose operation is described in Section 3.1.1, and can be expressed
mathematically as follows [67]:
BF[I](i, j) =∑k,l
I(k, l)w(i, j, k, l)
∑k,l
w(i, j, k, l)(4.1)
w(i, j, k, l) = Gσs
(√(i− k)2 + (j− l)2)
)Gσr
(|I(i, j)− I(k, l)|
)(4.2)
Gσ(x) = exp(− x2
2σ2
)(4.3)
Here, I(i, j) is the intensity of the source image at position (i, j) and BF[I] represents
the destination image. The ranges for k and l depend on the window size. While
the Xilinx OpenCV library already contains built-in bilateral filtering functionality
[68], we decided to provide our own implementation since it covers many interest-
ing aspects of the optimization methodology. The reference code already includes
optimizations not relevant to FPGA acceleration such as storing the Gaussian co-
efficients Gσs (see Equation 4.3) beforehand. The changes applied in Figure 4.3 are
explained below:
4.1. Detailed results 63
1. Original code. When using the reference implementation and an AXI-master
interface, the resulting FPGA design performs extremely slowly. This is be-
cause every input pixel is redundantly accessed many times, on top of the fact
that all iterations are executed serially.
2. Apply pipelining. A pipelining directive was inserted with a target initiation
interval of 31 clock cycles, which is the lowest amount that HLS could success-
fully synthesize for. Inner loops including the filtering and shifting operations
on the window are automatically unrolled this way (although we also insert
pragmas there for clarity). A bottleneck arises due to concurrent accesses to
the externally located input depth array.
3. Apply buffering & streaming. In accordance with the methodology for han-
dling Stencil-type kernels in Section 3.2.1, line buffer and window data struc-
tures were inserted. Every input element is now just accessed exactly once
instead of up to 25 times, allowing for a streaming implementation and a
pipelined initiation interval of one clock cycle. This modification is the most
drastic one from a technical standpoint. The new memory window as well
as the helper array containing Gaussian coefficients Gσs were partitioned into
registers to prevent them from forming a bottleneck. The line buffer was par-
titioned into four BRAM banks. To produce Gσr , the exponential function in
Equation 4.3 now has to be evaluated 24 times per clock cycle and has a la-
tency of 7 cycles using HLS’ default implementation, so that by Little’s Law
168 such operations are in progress at any moment in time. Since this has
caused the resource usage to skyrocket beyond the Zynq-7020’s capacity, the
strength of the algorithm will now have to be reduced in order to maintain the
desired speed.
4. Reduce window size. The memory window was reduced from 5x5 to 3x3, so
that the filtering operation effectively uses 9 values only per iteration instead
of 25. The resource usage has approximately dropped by a factor of 3, since
the amount of exponential functions called per iteration has decreased from 24
to 8.
5. Use fixed-point. In accordance with Section 3.2.5, all computations were moved
from floating point arithmetic into the fixed-point domain. A significant drop
in area utilization is observed again, while the decrease in accuracy is negligi-
ble.
6. Weaken exponential function. The calculation of Gσr is still the source of
high FF and LUT utilizations, and the full-fledged exponential function was
replaced by a piecewise linear approximation shown in Figure 4.4. The three
parameters (i.e. the coordinates of the first breaking point and the abscissa
64 Chapter 4. Implementation of KinectFusion in HLS
0 1 2 3 4 5 60.0
0.2
0.4
0.6
0.8
1.0exp(-x)ApproximationPopularity
FIGURE 4.4: Exponential function approximation for the bilateral fil-ter, with the actual frequency (popularity) of all arguments translated
to the thickness of the green layer.
of the second) of this approximation function were optimized by minimizing
the deviation between the mathematical and approximated outputs in a least
squares sense. To this end, real sensor data was used to measure the distribu-
tion of input arguments and to perform a weighted, more reliable optimization
accordingly. The total resource usage has again dropped by 30 %, with surpris-
ingly little effect on accuracy. The inv_exp block has a latency of just two cycles,
and uses 150 LUTs as opposed to 480 in the default HLS implementation.
Accuracy versus strength trade-off
The last three steps of Figure 4.3 can also be approached from a different standpoint.
Figure 4.5 explores the possible outcomes when we switch between two states of
every relevant parameter. The first term of each point label indicates the window
size, the second term indicates the data type domain, while the last term denotes
which implementation of the exponential function is in use. ’Fixed’ always includes
conversion from and to floating point numbers, and ’HLS’ means the built-in im-
plementation of the exponential function in contrast to the custom approximation.
Note that even though the implementation with all optimizations enabled is Pareto-
optimal, we observed that its error is relatively large. Copying the input straight to
the output leads to an average error of 0.06 %, so that profiles with a 3x3 window
might not be of sufficient quality. This is an important trade-off, and the ability to
combine multiple components together on the FPGA will be discussed extensively
in Chapter 5. We decided to proceed with the ’5x5 fixed pwlin (piecewise linear
4.1. Detailed results 65
FIGURE 4.5: Pareto diagram of the bilateral filter’s HLS average re-source usage (not including BRAM) and measured accuracy whenall eight possible configurations of three separate optimizations are
tested. One outlier with a large error is not shown.
exponential function)’ profile despite its high DSP usage of 65 %. Lastly, little im-
provement in accuracy was observed when increasing the amount of bits allocated
to the depth value’s data type, which is therefore kept at 16-bit fixed-point. Process-
ing element duplication (i.e. loop unrolling) was not explored precisely due to the
already high area utilization of this kernel.
4.1.3 half_sample (Stencil)
The half_sample procedure subsamples its input by a factor of two in every dimen-
sion, while also preserving edges by averaging only values that are close to each
other. Its computation pattern best fits the Stencil category, although every output
pixel depends on four input pixels that are uniquely mapped to that output pixel
only. This kernel is called twice per frame (once for 320× 240 to 160× 120 and again
for 160× 120 to 80× 60), and the estimated timings in Figure 4.6 denote the sum of
those two executions. The changes are explained in more detail as follows:
1. Original code. Synthesizing the reference implementation with AXI-master
and AXI4-Lite slave interfaces results in an initiation interval of 50 clock cycles.
2. Apply pipelining. HLS reports a pipelined initiation interval of 4 cycles; this
is is misleading however since the actual performance depends on whether
the DRAM will be able to keep up with the continuous DDR requests. Even
though every input pixel is accessed exactly once, the pattern is not linear as a
function of the address: instead, it quickly jumps back and forth between even
and odd rows of the source image.
66 Chapter 4. Implementation of KinectFusion in HLS
Unroll loop
Manual division
Use fixed-point
Buffering & streaming
Apply pipelining
Original code0 %
0 %
0 %
0.011 %
0.011 %
0.011 %
10 % LUT
13 % LUT
15 % LUT
11 % LUT
8 % LUT
17 % LUT
12.0 ms
0.96 ms
0.97 ms
0.97 ms
0.97 ms
0.25 ms Estimated timingMax. resource usage
Mean error
FIGURE 4.6: Effect of every optimization on the timing, resource andaccuracy profiles of half_sample (Stencil).
3. Apply buffering & streaming. Line buffer and memory window functionality
was implemented in order to reliably achieve an initiation interval of one clock
cycle. Due to the nature of streaming interfaces, the estimated timing is now
much more reliable.
4. Use fixed-point. This step omitted all DSPs which has the benefit of leaving
more space for other kernels, particularly the bilateral filter.
5. Manual division. A striking opportunity for optimization that seems not to
have been noticed by the compiler was found in Figure 4.7, part of the filtering
subroutine. This operation stems from the fact that depth pixels are only aver-
aged whenever their absolute value is within 30 cm of the window’s upperleft
value. In order to convert sum of a varying amount of values into an average,
a division by 1, 2, 3 or 4 is necessary. HLS implemented this as a full-fledged
unsigned division with a latency of 31 clock cycles, but it is hard to believe that
this degree of complexity is needed. After all, every case except one could con-
sist of a simple bitshift, and even a division by 3 ought to be much faster than
31 cycles. Indeed, by writing a switch-case construct and performing the di-
vision manually, the area utilization of the window calculation itself has more
than halved and the total iteration latency was reduced from 35 to just 5 cycles.
6. Unroll loop. Similar to mm2m_sample, this kernel was manually unrolled with
a factor of 4. The resource utilization has also doubled instead of quadrupled,
4.1. Detailed results 67
FIGURE 4.7: HLS performance analysis view of an unnecessarilycomplex division that went unnoticed by the HLS compiler.
the reason for which is visualized in Figure 4.8. Again, the true timing is ex-
pected to surpass 0.25 ms as the Zynq’s HP port is not able to transfer at rates
beyond 1,200 MB/s.
4.1.4 depth2vertex (Map)
The depth2vertex method converts every depth pixel into a 3D point by multiplying
the inverse camera matrix with (x, y, D(x, y)). Here, the camera matrix describes
the projection from the real-world space into the image plane, and D represents the
incoming depth map. The applied optimizations are described in Figure 4.9. Since
this routine is executed three times per frame (once for every pyramid level), the
total timing is the sum of these executions. Loop unrolling was not applied since
the output stream is 128 bit wide in the final variant: the required bandwidth for
optimum performance would already be 1,600 MB/s. Note that the actual width of
the point struct is 96 bit both for floating point as well as fixed-point representations.
However, the AXI DMA and Interconnect IP cores do not support memory-mapped
or stream data sizes that are not a power of two [23], leading us to insert padding for
those structs.4 Relative to the previous kernels, no other novelties arise in the HLS
optimization of this kernel.
4.1.5 vertex2normal (Stencil)
The vertex2normal uses four vertices surrounding the regular loop indices to com-
pute a normal vector at every position. The last part of every iteration is a vector
normalization, which in turn requires the calculation of a square root. Figure 4.10
summarizes the changes that were applied: clearly, moving the computation into
the fixed-point domain is not beneficial both in terms of resources and accuracy.
4Technically, data width conversions can be enabled within the AXI IP cores as well [23]. Whetherto control padding in HLS or in the Vivado block design is simply a choice to be made by the designer.
68 Chapter 4. Implementation of KinectFusion in HLS
(A) Before unrolling.
(B) After unrolling.
FIGURE 4.8: I/O diagram of the half_sample HLS kernel before andafter duplicating its processing elements 4-fold, assuming no band-
width bottlenecks.
Use fixed-point
Apply pipelining
Apply streaming
Original code0 %
0 %
0 %
0.093 %
9 % LUT
8 % LUT
16 % DSP
11 % LUT
32.3 ms
25.2 ms
1.01 ms
1.01 ms Estimated timingMax. resource usage
Mean error
FIGURE 4.9: Effect of every optimization on the timing, resource andaccuracy profile of depth2vertex (Map).
4.1. Detailed results 69
Use fixed-point
Buffering & streaming
Apply pipelining
Original code0.010 %
0.010 %
0.010 %
0.021 %
19 % LUT
11 % LUT
28 % LUT
35 % LUT
92.7 ms
12.1 ms
1.01 ms
1.01 ms Estimated timingMax. resource usage
Mean error
FIGURE 4.10: Effect of every optimization on the timing, resourceand accuracy profile of vertex2normal (Stencil). Contrary to most othercases, the conversion from floating point to fixed-point has a negative
effect here.
This perhaps surprising result was foreshadowed by Section 3.2.5, where it was de-
termined that the fixed-point square root calculation takes up more resources than
its floating point counterpart. In addition, more conversions between both domains
are required per iteration at the FPGA’s input and output, since 3D points consist
of three real-valued quantities per element. Our attempts to create a lighter yet suf-
ficiently accurate implementation of vector normalization were unsuccessful: they
did result in fewer LUTs, but the DSP utilization increased instead.
4.1.6 track (Gather & Map)
The track method is the first kernel where unpredictable memory access occurs.
Since there are still two input maps and one output map that is accessed at regu-
lar positions next to the randomly accessed reference array, its category is a mixture
of Map and Gather. While the author was unable to finalize the debugging and opti-
mization of this method due to time constraints, a list of the applied changes follows
anyway:
1. Original code. Due to the high degree of control structures present in this
kernel, the iteration latency varies between 28 and 171 clock cycles. The control
flow of every iteration strongly depends on the content of the four vertex and
normal maps.
2. Pipelining & streaming. After converting the interfaces into streams except
for the two reference maps and pipelining the horizontal loop, synthesizing
this function was found to be difficult for HLS due to timing constraints. Since
70 Chapter 4. Implementation of KinectFusion in HLS
XX
Intelligent buffering
Use fixed-point
Pipelining & streaming
Original code0.648 %
0.648 %
0.824 %
1.125 %
28 % LUT
36 % LUT
57 % DSP
89 % BRAM
92.6-566 ms
265 ms
132 ms
23.0 ms
Estimated timingMax. resource usage
Mean error
FIGURE 4.11: Effect of every optimization on the timing, resource andaccuracy profile of track (Gather & Map).
the indices requested from both reference arrays residing in DRAM are data-
dependent and therefore completely unpredictable for the HLS compiler, this
created a very long critical path in the resulting design. The target clock period
had to be increased to a whopping 80 ns for the DDR requests to be able to
complete in time before the next iteration could perform its pair of requests.
The lowest feasible pipelined initiation interval equals 10 clock cycles despite
this lengthy clock period.
3. Use fixed-point. The usage of fixed-point arithmetic allowed us to decrease
the clock period to 40 ns. Since there is much more input and output data than
before, conversions between floating point and fixed-point were left out in this
configuration however; they can be assumed to take place in preceding blocks,
albeit at a slower rate to keep the resource utilization at bay.
4. Intelligent buffering. The decisive transformation of this kernel into a reason-
ably fast variant is one that incorporates key insights into what the algorithm is
actually doing. The track routine constitutes part of the Iterative Closest Point’s
projective variant, but since the sensor’s movement from one frame to the next
is often limited [5], the idea is that distances between corresponding points
over subsequent frames should also be small. In other words, the 2D positions
of both pixels ought to be close to each other. This is confirmed by Figure 4.12,
a heatmap which depicts how frequently both positions relate to each other
in various configurations. It appears that the indices accessed in the reference
arrays are (statistically speaking) almost always in the neighbourhood of the
indices linked to the linear loop over the actual input maps, assuming both
4.1. Detailed results 71
FIGURE 4.12: Heatmap of the accessed pixel positions within the ref-erence maps relative to the corresponding regular loop over the inputmaps for the first level of track. Yellow means high frequency, purplemeans the opposite. The underlying data was extracted from fiveframes selected over a video fragment captured at 30 FPS, and showsthat horizontal movement of up to 750 pixels per second occurred at
some point.
sources have the same size. A line buffer-like structure was therefore inserted,
containing several rows of both reference maps. In-between every row of itera-
tions, a burst copy is made to update the scratchpad. Any number of rows can
in principle be chosen, depending on the desired maximal BRAM utilization.
For example, 41 rows leads to 89 % BRAM usage and means that ∆x is un-
bounded although |∆y| ≤ 20 must hold for a cache hit to occur. If a requested
pixel happens to fall outside of the currently buffered data, then a default code
equivalent to an invalid measurement is returned. Note that the increase in
error is unexpected behaviour (a bug seems to be present that we could not
successfully resolve), although the optimization of this kernel is still presented
as a valid proof of concept for the application of the scratchpad technique, re-
sulting in an achieved theoretical speed-up factor of 5 to 6.
4.1.7 reduce (Reduce)
The reduce kernel sums various mutually multiplied elements of track’s output stream,
in order to be fed into update_pose later on. No real novelties arise in its optimiza-
tion described in Figure 4.13, except that the struct holding all output values had
to be partitioned into multiple registers in order to achieve an initiation interval in
HLS of exactly one clock cycle. Moreover, the computations have to happen in the
fixed-point domain, as cumulative floating point additions restrict the II to at least 4
cycles as well. However, since the input stream is 256 bit wide corresponding to a
computation speed of 3,200 MB/s, the practical throughput is expected to be lower
by a factor of around three due to AXI_HP’s bandwidth constraints.
72 Chapter 4. Implementation of KinectFusion in HLS
Use fixed-point
Partition array
Pipelining & streaming
Original code0.014 %
0.014 %
0.014 %
0.465 %
22 % LUT
15 % LUT
15 % LUT
38 % DSP
95.9 ms
13.2 ms
13.2 ms
3.31 ms Estimated timingMax. resource usage
Mean error
FIGURE 4.13: Effect of every optimization on the timing, resource andaccuracy profile of reduce (Reduce).
4.1.8 integrate (Gather & Map)
The integrate method performs volumetric integration, and loops over every element
in the 3D volume to incorporate the most recently captured sensor measurement.
The optimizations, summarized in Figure 4.14, incorporate two important changes.
First, the full 320× 240 depth array of 300 KiB is copied to local storage before start-
ing the loop, similar to block-by-block processing described in Section 3.2.4. The
HLS compiler allocates a number of instances in order to fit up to 512 KiB instead of
300 KiB, which explains the high BRAM utilization. Only one block is used in this
use case, but the kernel supports upscaling to multiple blocks as well. For a depth
map of 640× 480, three sections and thus re-executions of the full routine would be
required. Due to the huge volume size however, this would bring up the maximal
runtime to a rather impractical 503 ms. Second, the realization is made that not all
iterations can possibly contribute to a meaningful update of the volume. The cam-
era is always looking at just a part of the global volume, around which a minimally
encompassing cube can be mathematically determined. This principle is shown in
Figure 4.15. Since the total number of iterations now depends on the size of this
subvolume, the timing now also depends on the precise sensor location and can be
anywhere between 0 and 168 ms. At the start of the KinectFusion algorithm, the
camera’s perspective is by definition positioned at the center of the volume span-
ning (8 m)3, oriented perpendicularly towards one face of the volume. Knowing the
Kinect v1 depth images’s field of view of 58.5 x 46.6 degrees [69], the initial encom-
passing cube volume is calculated to be approximately 9.7 % that of the total volume.
As such, this fraction is also taken to produce a rough estimate of the average timing,
resulting in 16.2 ms.
4.1. Detailed results 73
XX
Limit to frustum
Block processing
Use fixed-point
Pipelining & streaming
Original code0.147 %
0.147 %
0.322 %
0.322 %
0.322 %
31 % LUT
32 % LUT
88 % LUT
100 % BRAM
100 % BRAM
3.36-23.8 s
839 ms
168 ms
168 ms
0-168 ms
Estimated timingMax. resource usage
Mean error
FIGURE 4.14: Effect of every optimization on the timing, resource andaccuracy profile of integrate (Gather & Map).
FIGURE 4.15: Two-dimensional illustration of a frustum-encompassing block, to which loop boundaries can safely berestricted. The green coloured blocks represent volumetric elementsthat are visible from the sensor’s current position, meaning that all
yellow elements remain unchanged during integration.
74 Chapter 4. Implementation of KinectFusion in HLS
Unchanged code Fully optimized Methodology Additional TotalKernel HLS timing HLS timing speed-up speed-up speed-upmm2m_sample 24.6 ms 0.38 ms ×8.01 ×8.08 ×64.7bilateral_filter 1109 ms 0.77 ms ×1422 ×1.01 ×1440half_sample 12.0 ms 0.25 ms ×12.4 ×3.88 ×48.0depth2vertex 32.3 ms 1.01 ms ×32.0 ×1 ×32.0vertex2normal 92.7 ms 1.01 ms ×91.8 ×1 ×91.8track 566 ms 23.0 ms ×4.29 ×5.74 ×24.6reduce 95.9 ms 3.31 ms ×29.0 ×1 ×29.0integrate 23.8 s 16.2 ms ×142 ×10.4 ×1469
Median ×30.5 ×2.45 ×56.4
TABLE 4.3: Impact of the optimizations arising from adoption of themethodology versus use case-specific knowledge on the estimated
performance of KinectFusion’s kernels in HLS.
4.2 Discussion
This chapter describes the high-level synthesis design phase of all FPGA-eligible
KinectFusion kernels except one. It is clear that a significant portion of this work ex-
tends beyond merely applying the methods presented in Section 3.2, although they
did provide a strong headstart in both qualitative and quantitative terms. A non-
trivial understanding of the algorithm is nonetheless required to apply the proce-
dures correctly, and to take advantage of opportunities for additional time-oriented
or resource-oriented optimizations.
4.2.1 Evaluation of the methodology
A straightforward application of all techniques discussed in Chapter 3 yields a me-
dian speed-up of a factor 30.5 for the individual components of KinectFusion, while
extra changes created an additional median performance gain of 2.45. Table 4.3 goes
into more detail, and lists the effect of these two types of design transformations
for each kernel. It is not always evident to separate changes originating from the
methodology and ’additional’ optimizations. Loop unrollings are standard practice
in the FPGA community for example, but were not described in Chapter 3. More-
over, the biggest reason why duplication of processing elements proved useful is
because the two respective routines (mm2m_sample and half_sample) are downsam-
plers. On the other hand, the intelligent buffering method in track does, strictly
speaking, belong to the methodology but was instead included in the additional col-
umn because it still requires a significant deal of insight into the SLAM application
and its context. Not listed in Table 4.3 is the impact on resources, even though sev-
eral important extra optimizations did drastically decrease the total area utilization
of some components.
While different in philosophy from the range of source-to-source compilers found
in literature [58], [70], [71], we conclude that the more manual HLS methodology that
4.2. Discussion 75
we introduced based on a selection of SLAM kernels is a positive result, and might
even be preferable to using the aforementioned tools. It is hard to imagine that auto-
matic code translation will be able to reliably deal with superfluous hardware usage
such as the division in half_sample. While this instance is not particularly detrimen-
tal in any respect, the implication that similar situations could get overlooked does
suggest that the degree of automation is generally correlated with a lesser quality of
the final design. Working at a lower level enables more fine-grained control of all as-
pects of the design process, which in turn opens gateways towards better efficiency
and performance.
Furthermore, the reason behind the lack of existence of a fully automatic FPGA
compiler becomes even clearer once we move outside the domain of image pro-
cessing. The design space grows larger exponentially as more possible paths can
be taken, and Vivado HLS for example has to ask for hints in the form of pragmas
in order to incorporate indispensable human intelligence into its compilation pro-
cess. Otherwise, the dimensionality of DSE would become unfathomably high; even
more so when it comes to the combined implementation of multiple kernels dis-
cussed in Chapter 5. As a result, the workflow of HLS aims to provide a balanced
mix of high-level and low-level details. In addition, opportunities for optimizing
individual hardware operations can still be exploited, while the repetitive specifics
of established paradigms such as pipelining and I/O interfacing are taken care of
mostly automatically.
77
Chapter 5
System-level acceleration ofmultiple kernels
Henceforth, we decided to focus on the first five kernels of KinectFusion, most of
which multiple instances (also called levels, modes or variants) are called for every in-
put depth frame. The other parts of the algorithm would certainly lend themselves
to deeper investigations as well, but were deemed out-of-scope for this thesis due to
technical difficulties. The dataflow principles presented in this chapter are largely
applicable to the remaining half of KinectFusion as well. In addition, we refer to [37]
for an FPGA-implementation and system-level integration of track and reduce. As
shown in Figure 3.2, all five kernels mm2m_sample through vertex2normal are inde-
pendent of the global reconstruction volume, as they serve to transform the sensor
measurement into various formats that will be used in later stages of KinectFusion.
The term system-level acceleration refers to the question of how to accelerate not
just one kernel, but a multitude of algorithmic components within the heterogeneous
system. Recall that an SoC essentially offers a CPU and an FPGA with a common
DRAM. The quality of the design implemented onto the FPGA as well as the regula-
tory software program on the CPU will determine the extent to which the coopera-
tion between PS and PL occurs harmoniously. As mentioned in the introduction, this
sophisticated duality of having to manage both hardware and software completely
leaves a high degree of freedom to the system designer. We propose three different
architectures:
1. Independent coexistence of kernels. Every accelerator, each corresponding to
one kernel, exists as an independent block in hardware and fully manages its
own connection to DDR via the PS. No direct inter-component communication
is possible.
2. Block-level dataflow. Subsequent kernels on the same datapath are also di-
rectly connected in hardware. The many complications arising from the prac-
tical implementation of this idea are resolved during the Vivado block design
phase.
78 Chapter 5. System-level acceleration of multiple kernels
3. HLS-level dataflow. Similar to the above, but the issues are resolved through
C++ code within the HLS top function instead. All kernels thereby reside in-
side just one IP core.
After reviewing the specifics of KinectFusion’s challenging multi-level dataflow, the
remainder of this chapter will implement and compare the three aforementioned
configurations. Note that both the problem statement and its solutions are also care-
fully generalized, which means that throughout this text we will occasionally jump
back and forth between abstract principles and their concrete application in the con-
sidered use case.
5.1 Dataflow of KinectFusion
The first five kernels of KinectFusion are shown in Figure 5.1, putting emphasis on
which data they process, how the various blocks interact and how they depend on
each other. A ’pyramid’ with three distinct levels of data processing can be distin-
guished, all of which are needed to perform the multi-level tracking phase after-
wards (see Figure 3.2). From a performance analysis perspective, every block can be
summarized as in Table 5.1 and as follows:
1. mm2m_sample: Accepts a raw sensor measurement array of unsigned short
integers as input and downsamples by a factor two in both dimensions to pro-
duce an array of single-precision floating point depth values as output. The
data rate between input and output is halved on average, although this ratio
varies between 0 and 1 at runtime.
2. bilateral_filter: Filters an array of single-precision depth values, maintaining
the same resolution and datatype at its output. The data rate stays equal be-
tween input and output.
3. half_sample: Downsamples an array of depth values by a factor of two in both
dimensions. The data rate between input and output is decreased by a factor
of 4 on average, but varies between 0 and 1/2 at runtime.
4. depth2vertex: Converts an array of single-precision depth values and to 128-
bit point structs, maintaining the same resolution at its output. The data rate
is increased by a factor of 4 between input and output.
5. vertex2normal: Filters an array of point structs, maintaining the same resolu-
tion and datatype at its output. The data rate stays equal between input and
output.
All kernels under consideration have fast 2D streaming implementations thanks
to the HLS optimizations that were investigated in previous chapters. The ques-
tion arises of how to most appropriately handle the execution of multiple kernels if
5.1. Dataflow of KinectFusion 79
FIGURE 5.1: Dataflow diagram of the first five kernels of KinectFu-sion.
Input Input Output Output Data rateInstance size width size width factormm2m_sample 640× 480 16-bit 320× 240 32-bit 1/2bilateral_filter 320× 240 32-bit 320× 240 32-bit 1half_sample1 320× 240 32-bit 160× 120 32-bit 1/4half_sample2 160× 120 32-bit 80× 60 32-bit 1/4depth2vertex0 320× 240 32-bit 320× 240 128-bit 4depth2vertex1 160× 120 32-bit 160× 120 128-bit 4depth2vertex2 80× 60 32-bit 80× 60 128-bit 4vertex2normal0 320× 240 128-bit 320× 240 128-bit 1vertex2normal1 160× 120 128-bit 160× 120 128-bit 1vertex2normal2 80× 60 128-bit 80× 60 128-bit 1
TABLE 5.1: I/O characteristics of all instances of KinectFusion’s firstfive kernels.
80 Chapter 5. System-level acceleration of multiple kernels
we want to off-load them all onto the FPGA. More specifically, the current problem
statement can be expressed as follows: given the task of having to obtain all seven
required output maps shown in Figure 5.1, how do we best design the architecture
of our hardware and software in order to calculate the desired data as efficiently as
possible? Here, the goal of efficiency aims to maximize speed while minimizing re-
source utilization. We assume that none of the 2D arrays can possibly fit on the PL at
once, so that the streaming paradigm somehow has to be maintainted throughout all
computations. On the other hand, most streaming implementations assume a single-producer, single-consumer pattern. This means that it is impossible to connect a single
output to multiple inputs of other blocks, unless those blocks are forced to always
operate in a fully synchronized manner and thus to run in parallel at the same rate.
The last constraint is to maximize the degree of resource sharing among all blocks,
so that different instances of the same kernel across multiple levels are preferably
not implemented as completely separate layouts on the FPGA.
5.1.1 Generalized problem statement
The first step is to recognize that handling the complex dataflow of KinectFusion
concerns two different but related challenges. Figure 5.2a depicts the configuration
of a general string of kernels when no multi-level functionality is present and the
single-producer, single-consumer pattern holds everywhere. These blocks can quite
easily be coarse-grain pipelined as explained in Section 5.4. Building onto this base
case, Figure 5.2b then adds the requirement that intermediate outputs from kernels
residing in the middle also have to be stored for later usage. If the output of A is
not stored somewhere, then it is lost forever because B transforms the stream into
something else. As mentioned before, temporarily storing the data in local memory
for later retrieval is not possible either, so it has to be written directly to the DRAM.
How to achieve this represents the first aspect of our generalized problem statement.
Next, Figure 5.2c exemplifies the concept of similar instances of kernels handling
different streams. Here, a high similarity means that a large fraction of hardware area
can be shared among the different variants Ai. In practice, these variants might sim-
ply consist of HLS top functions that are being called with different parameters. The
fact that the back-end hardware implementation of multiple instances can be merged
to some degree will prove to be beneficial in minimizing resource utilization, espe-
cially for a low-end device such as the Zynq SoC. However, the maximum degree
of high-level parallelism is also restricted this way, so that a trade-off will have to
be made. How to efficiently reconcile this with intermediate output accumulation
represents the second part of our problem statement.
5.2. System architecture 81
(A) Simple configuration: all kernels are connected together in a linear fashion.
(B) Configuration with intermediate output aggregation: the output of some ker-nels is needed in later stages as well, rather than by its direct successor only.
(C) Configuration with multi-level execution: several variants of each kernel han-dle data streams in principle independently from each other.
FIGURE 5.2: Illustration of two generalized dataflow challenges.
5.2 System architecture
Figure 5.3 depicts the important aspects of our initial system architecture. Here, the
processor is connected to all IP cores present in the PL via an AXI-Lite interface. This
is a low-throughput memory-mapped protocol that enables simple communication
of control and status registers [23], for example in order to start and stop the kernel.
The PS acts as a master from its General Purpose (AXI_GP) port, while the AXI DMA
and HLS IP core act as slaves. Both the PS and PL have access to a common physical
DRAM via a DDR controller. The DMA serves to provide a high-speed communi-
cation facility between the HLS IP core and the DDR controller. The label of the in-
put stream, AXIS_MM2S, denotes a protocol that converts a memory-mapped address
space (the DRAM) into an AXI stream. Essentially, the DMA is what matches the
DDR data to the interface of the custom IP core by transferring the data in a stream-
ing manner from and to the IP core. The second connection between DMA and IP
core, called AXIS_S2MM, reads the output stream and writes it back to the DRAM in
memory-mapped format. The DMA serves as a master over the Zynq’s bidirectional
HP port via a regular AXI protocol. The widths of the AXIS_MM2S and AXIS_S2MM
streams must be a power of two and can be up to 1024 bit, however the HP port’s
maximal data width is 64 bit. Not shown on the diagram are AXI Interconnect and
SmartConnect IP cores, which up- or downconvert streams as needed to ensure they
82 Chapter 5. System-level acceleration of multiple kernels
FIGURE 5.3: Overview of the System-on-Chip architecture for the ex-ecution of a custom IP core.
have the correct bit widths [23].
5.2.1 Hardware debugging
The debugging of real-world hardware is far from as evident as debugging of soft-
ware, although Xilinx provides an IP core called the System Integrated Logic Ana-
lyzer (ILA) to perform in-system debugging of designs on an FPGA after implemen-
tation [31]. In Vivado, interfaces between blocks as well as signals inside IP cores can
be monitored by marking them for debug during the design phase or after synthesis.
The hardware manager in Vivado then allows the user to select probes and set-up
triggers on certain values of that signal. For example, a waveform spanning 1024
clock cycles can be captured once the TVALID or TLAST signal of an AXI stream be-
comes true. This technique is very useful for precise timing, latency and bottleneck
analysis, or to figure out why a system is not working at all.
5.2.2 Bandwidth limitations
In this thesis, all IP cores including the accelerators are fixed to a clock period of
10 ns. Given the Zynq HP port’s maximum data width of 64 bit, this means that a
single-way throughput of no more than 800 MB/s per interface is feasible from and
to DRAM. The theoretical upper limit of 1,200 MB/s could be achieved by introduc-
ing a second clock domain and increasing the AXI Interconnect’s clock frequency
5.3. Independent coexistence of kernels 83
between the DMA and PS to 150 MHz (or 200 MHz such as in [25]). Due to tech-
nical difficulties and time constraints1, this opportunity was however regarded as
less important and not further explored. Despite the suboptimal configuration, the
conclusions drawn from this chapter can be applied to any other values of such lim-
itations as well. After all, the goal of this chapter is to investigate which method is
best suited to resolve multi-level dataflow problems, rather than to achieve a maxi-
mal frame rate for KinectFusion using every possible available feature.
The only other way to process data at rates higher than 800 MB/s is to use multi-
ple PS-PL ports. The Zynq-7020 SoC has 4 HP ports and one Accelerator Coherency
Port (ACP), the latter of which can be made cache-coherent but is otherwise prac-
tically similar to a HP port, sharing the same maximum throughput. This idea is
explored in the next section. In that case however, the concurrent execution of mul-
tiple kernels is not unbounded either. In an experiment done by [24] using all four
HP ports, the bandwidth of the DDR interface to the external DRAM is determined
to be 4,264 MB/s. Furthermore, the maximum throughput when using all four HP
ports is only 3,333 MB/s, or 78 % of the DRAM bandwidth. Given the speed of 800
MB/s per port in our design, we initially expect few problems to arise here unless
all five ports would be in use at the same time.
5.3 Independent coexistence of kernels
One way to off-load multiple distinct subroutines of an algorithm to the FPGA is to
simply place them together in the Vivado block design, and connect every streaming
kernel with a separate HP or ACP port to the PS via an AXI DMA. This concept is il-
lustrated in Figure 5.4. Every function can then be called separately by the software
running on the processor, and no direct communication has to occur between the
components. Instead, the data is retrieved from and stored back to the DRAM every
time. This extra overhead and lack of inter-kernel communication forms a draw-
back in most cases, although sometimes are good reasons to do so. Naturally, the
available PS-PL ports will quickly become filled up this way, which places an upper
bound of five coexisting kernels in the case of a Zynq-7020 SoC. A well chosen set
of components to be accelerated whose combined resources still fit on the PL can
nevertheless generate a significant speed-up to the whole system. In addition, the
power efficiency will also be much better than what can be achieved by off-loading
just a single accelerator, in terms of useful computation done per Watt.
1Synthesizing and implementing a block design in Vivado takes on the order of one to severalhours, which limits the number of iterations we could perform.
84 Chapter 5. System-level acceleration of multiple kernels
FIGURE 5.4: System architecture when five coexisting kernels are im-plemented together on the FPGA. By allocating one port for everyaccelerator, hard constraints on concurrent executions are avoided.
5.3. Independent coexistence of kernels 85
In general, the selection of which subset out of N kernels to off-load to the FPGA
can be stated approximately as the following mathematical optimization problem:
minimizesi
N
∑i=1
siti,FPGA + (1− si)ti,CPU
subject toN
∑i=1
siri,BRAM ≤ 1
N
∑i=1
siri,DSP ≤ 1
N
∑i=1
siri,FF ≤ 1
N
∑i=1
siri,LUT ≤ 1
(5.1)
Here, the decision booleans si denote whether to execute component i on the FPGA
(1) or the CPU (0). The variables ti measured beforehand equal the total time spent
in each kernel when it is executed on either of these devices. The resource fractions
ri ∈ [0, 1] denote the resource utilization as a fraction of the total available amount of
that type provided by the FPGA. Note that three important assumptions are made:
1. Every component in executed in isolation from any other. The objective func-
tion has to be modified to account for concurrent executions and scheduling
opportunities; in practice this happens on a case-by-case basis.
2. The PS-PL communication overhead is zero; one component can start imme-
diately after the other and no waiting has to occur. This is a very reliable ap-
proximation if the individual timings already account for bandwidth bounds
and latencies though.
3. The resource utilization of combined components, no matter the si-vector, al-
ways equals the sum of individually off-loaded components. From our ex-
perience, these values tend to be off by around 35 % in a favourable way,
meaning that the post-implementation total resource utilization is actually less
compared to the post-implementation resource profiles of single-kernel accel-
erators. This non-linear scaling is explained by the increased opportunity of
resource sharing as more and more functionality enters the FPGA
For our research, we intend to investigate the case of accelerating all five components
at once. A sacrifice therefore had to be made regarding the bilateral filter’s complex-
ity: its window size was reduced from 5x5 to 3x3 as the design could otherwise not
fit on the Zynq-7020 FPGA.
86 Chapter 5. System-level acceleration of multiple kernels
ARM Cortex-A9 HLS Zynq-7020 HLS ActualKernel CPU report FPGA speed-up speed-upmm2m_sample 2.70 ms 0.38 ms 0.77 ms x7.0 x3.5bilateral_filter 426 ms 0.77 ms 0.78 ms x552 x544half_sample 1.82 ms 0.24 ms 0.50 ms x7.6 x3.7depth2vertex 7.90 ms 1.01 ms 2.05 ms x7.8 x3.8vertex2normal 27.4 ms 1.01 ms 2.06 ms x27.2 x13.3Total 467 ms 3.41 ms 6.16 ms x137 x75.6
TABLE 5.2: Time spent in each kernel as measured on both the PS andPL of the Zedboard. Summing these values assumes that all kernelsare executed separately in time, and can be placed side by side onto
the same FPGA.
5.3.1 Performance analysis
This section benchmarks the design where IP cores A through E in Figure 5.4 are
filled in by mm2m_sample, bilateral_filter, half_sample, depth2vertex and vertex2normal.The initial test is to engage every accelerator separately in time, although concurrent
executions will be discussed directly afterwards.
Isolated executions
Executing KinectFusion’s accelerated kernels in isolation at a clock period of 10 ns
produces Tables 5.2 and 5.3. With the current architecture, every block except bi-lateral_filter clearly performs twice as slowly as was estimated by Vivado HLS. This
is logical, since an I/O interface throughput of at least 1,600 MB/s was assumed in
those designs. Recall that mm2m_sample was unrolled by a factor of 8 and half_sampleby a factor of 4 in Chapter 4. The effective unroll factors have become 4 and 2 respec-
tively, due to a PS-PL interfacing bottleneck. depth2vertex and vertex2normal work
with 128-bit data points and therefore suffer from a very similar bottleneck, plotted
in Figure 5.5. By comparing the PS-PL interface signals and the DMA streaming sig-
nals, it is revealed that the PL is forced to split up every 128-bit packet into two 64-bit
packets in order to pass the stream of 3D points via the HP port. This can be con-
firmed by looking at the frequency and placement of the 32 zero-padding bits, which
serve the purpose of fitting a 96-bit struct element inside the 128-bit AXI streaming
format. This data width conversion causes the DMA to adapt to the slowest stream
of 64-bit and thus require two clock cycles per communicated element, so that the
initiation interval has effectively doubled from 1 to 2.
Figure 5.4 depicts the architecture that is employed to off-load all five kernels at
once. For mm2m_sample and half_sample, implementations with unrolling factors of 4
and 2 respectively were used, because the computationally stronger variants do not
yield any real-world increase in performance. The performance and communication
bounds (see [62]) are precisely matched to each other this way, so that no bandwidth
5.3. Independent coexistence of kernels 87
Input Input Output Output Elements producedKernel width rate width rate per clock cyclemm2m_sample 16-bit 800 MB/s 32-bit 400 MB/s 1bilateral_filter 32-bit 400 MB/s 32-bit 400 MB/s 1half_sample 32-bit 800 MB/s 32-bit 200 MB/s 0.5depth2vertex 32-bit 200 MB/s 128-bit 800 MB/s 0.5vertex2normal 128-bit 800 MB/s 128-bit 800 MB/s 0.5
TABLE 5.3: Realized maximum I/O throughputs that conforms to HPport bandwidth bounds. The data widths and elements processed perclock cycle are measured in terms of data units meaningful to Kinect-Fusion (e.g. one depth value), without regard for details involving
packed structs.
(A) AXI streams directly attached to the HLS IP core (128-bit wide).
(B) Zynq HP port transferring data read from DRAM (64-bit wide).
FIGURE 5.5: Waveforms produced by the System ILA for the ver-tex2normal kernel.
88 Chapter 5. System-level acceleration of multiple kernels
is left and no superfluous hardware resources are used up. Coincidentally, this con-
figuration just barely fits on the Zynq-7020 FPGA with a LUT utilization of 88 %. If
every instance were to be executed separately in time, then the total time spent in
these first five kernels would be 6.16 ms. The FPGA therefore has the capacity to
process 162 frames per second, writing all seven temporary outputs to the DRAM
per incoming sensor measurement. Note that this value does not represent a real
FPS of the full system, unless it is assumed that all surrounding components of the
algorithm work sufficiently fast as well.2 While independent execution of kernels on
the FPGA is already a huge improvement over merely using the CPU, opportunities
for parallelism among different blocks exist and will be exploited next.
Frame-level pipelining
Returning to Figure 5.1, it is evident that not all components need to wait for each
other. In particular, once the bilateral filter finishes and its result is written back to
DRAM, then there is no reason for either of depth2vertex and half_sample1 to hold their
commencement until the other has completed. After all, the ’coexistence’ configura-
tion allows for any degree of concurrent execution to occur thanks to the presence of
independent PS-PL links for each of its components. Taking into account the inter-
kernel data dependencies, Figure 5.6a illustrates how the accelerators’ executions
should be scheduled over time so that every result is available as early as possible.
We again introduce the concepts of iteration latency (IL) and initiation interval (II), de-
fined very similarly to Chapter 3 albeit on a higher level in this context. The timings
are calculated theoretically assuming the speeds of Table 5.3 hold true at all times,
and there exists no delay in-between the execution of multiple kernels or instances.
For example, half_sample1 produces 160× 120 32-bit elements at an average through-
put of 200 MB/s or equivalently, 1 output value for every 2 clock cycles, so that the
estimated timing is 160× 120× 2× 10−5 = 0.384 ms.
The realized iteration latency of Figure 5.6a’s execution on the SoC is 5.31 ms.
This is quite close to but still more than the theoretical value of 5.09 ms, indicating
that a bottleneck occurs. Interestingly, the bilateral filter behaves strangely when it
is part of the coexisting configuration. Even when executed in isolation, its timing
is consistently measured to be 0.97 ms instead of 0.78 ms as in the standalone con-
figuration when only one IP core is implemented on the PL. Figure 5.7 shows the
AXI stream signals, confirming that the unexpected issue is present. When taking
this extra latency into account, the increase from 5.09 ms to 5.31 is logical.3 Some
aspect of the component being placed on a nearly full FPGA and connected to one
HP port among a fully occupied set of PS-PL ports seems to generate a slowdown,
2For example, it was experimentally determined that reading a new input frame from the SD cardtakes around 50 to 70 ms, which just barely threatens the real-time constraint of achieving 15 FPS.
3The extra 30 µs can simply be attributed to the summed latencies of individual executions.
5.3. Independent coexistence of kernels 89
(A) Schedule to process a single sensor frame.
(B) Schedule to process multiple sensor frames. Both depth2vertex and vertex2normal determine thesmallest possible initiation interval.
FIGURE 5.6: Diagrams depicting how the five kernels should be ex-ecuted in time if the DDR access speed were unlimited. The rowscorrespond to accelerators each managing their own DMA and PS-PL port, while the distinct tasks are labelled with resolution levels (0
stands for 320x240, 1 for 160x120 and 2 for 80x60).
90 Chapter 5. System-level acceleration of multiple kernels
FIGURE 5.7: System ILA waveforms for bilateral_filter when it is exe-cuted alone, releaving a strange hiccup. The vertical lines are spaced
200 ns.
although we are unable to explain precisely where this bottleneck lies (we do not
exclude the possibility of hardware, software or system bugs either). However, this
particular incident does not significantly impact the conclusions we can draw from
our experiments with regards to multi-level dataflow, and we chose to focus on more
important matters rather than to extensively debug the encountered situation. Mea-
surements of all other kernels showed that they behave exactly as expected, and
achieve the same performance results as their standalone counterparts.
The natural extension of the discussed configuration is to set-up a pipelined ex-
ecution of all blocks as in Figure 5.6b. This way, multiple frames can be processed in
an overlapped manner. The initiation interval is (320× 240+ 160× 120+ 80× 60)×2× 10−5 = 2.02 ms in theory, but this assumes that the DRAM has no problem deal-
ing with three concurrent streams of 200-800 MB/s to different memory locations at
once, continuously generating around 2 GB/s of traffic in both directions. The re-
alized initiation interval is 2.53 ms (an increase of 21 %), while the iteration latency
is 6.02 ms (an increase of 18 %). As every individual HP or ACP port is requested
to handle only 800 MB/s at most by the PL blocks, the increased II can very likely
be attributed to a bottleneck arising from the high DDR workload. The presence of
a slowdown is confirmed by plotting the half_sample I/O signals in Figure 5.8. The
DMA is clearly sending and receiving burst copies instead of operating at its full
capacity; correspondingly, the streaming interfaces are only active for a fraction of
the time. The increase in IL can be explained for exactly the same reasons. Over-
all, the system is performing around 20 % slower than expected, and the differences
between both relative increases are small enough to be blamed on random measure-
ment noise (our timing method is sufficiently accurate but of course not perfect).
5.4 Task-level pipelining
The basic configuration in Figure 5.2a, linking data streams of multiple components
together, lends itself to an obvious upgrade in the form of task-level pipelining. This
concept allows for the overlapped execution of subsequent kernels, in the sense that
5.4. Task-level pipelining 91
FIGURE 5.8: System ILA waveforms for half_sample in the multi-frameexecution. Large-scale pauses and restarts are clearly visible, and oc-cur presumably due to the DDR controller having to operate at full
capacity. The vertical lines are spaced 1 µs.
every element produced by a given component is immediately processed by the next
component as soon as it becomes available.4 By inserting small channels (often FIFO
buffers) in-between these blocks, it is therefore possible to completely bypass the re-
quirement of having to store large bodies of intermediate data somewhere while the
next component is waiting for its predecessor(s) to finish all their computations. In
practice, the HLS DATAFLOW pragma is the ideal lubricant for this paradigm as it
automates many aspects of its implementation [1], [72]. C/C++ functions as well as
individual for-loops can effectively be chained together to form a single ’superblock’
as if it were one large for-loop, although several restrictions apply. The usage of this
optimization directive to enable linear task-level pipelining is standard practice and
will not be lectured here; instead, we consider how situations resembling Figures
5.2b and 5.2c could possibly be tackled efficiently using the underlying concept de-
spite its limitations, in order to eventually arrive at two candidate solutions for Fig-
ure 5.1. The multi-level dataflow principles discussed in the following paragraphs
can essentially be applied in two distinct manners:
1. In the block design of Vivado, where standard task-level pipelining can be
generated naturally by connecting subsequent kernels directly together.
2. In Vivado HLS, where the dataflow directive stands central but special mea-
sures have to be taken to reconcile its operation with conflicting requirements.
A comparison will then be made of all reviewed techniques in Section 5.5.
4We have encountered three forms of pipelining at this point, so that it is appropriate to emphasizethe differences between them:1) HLS pipelining concerns subsequent iterations of a loop and happens at a low level, i.e. that of flip-flops and hardware blocks.2) Frame-level pipelining concerns subsequent frames of a video fragment, and happens at the muchhigher level of complete accelerators residing on the PL to be controlled by the PS.3) Task-level pipelining concerns subsequent components within one accelerator. It also happens at adifferent level than regular HLS pipelining, yet is distinct from the previous concept.
92 Chapter 5. System-level acceleration of multiple kernels
5.4.1 Intermediate output aggregation
Extracting data streams from in-between two blocks breaks the single-producer,
single-consumer principle that is fundamental to the operation of task-level dataflow
in Vivado HLS [72]. Even a basic stream construct does not allow multiple reads of
the same element; instead, every call to read() also consumes that element so that
it can never be retrieved again unless a copy of the data is preserved somehow. AXI
stream broadcast IP cores or custom stream duplicators in HLS provide a partial
solution to this problem, although the question still remains of how to redirect the
duplicated data either to a temporary storage, or directly to the input side of the
next component where it is needed. Since the latter case requires all kernels to exe-
cute together in a synchronized manner, resource sharing among similar variants of
a kernel cannot really be exploited.5 Therefore, only the former case is focused on,
as this will allow all instances of a kernel to run separately in time while still taking
advantage of task-level pipelining.
In Vivado’s block design, Figure 5.2b can be realized as Figure 5.9a (four IP cores
are shown here, but this amount can again vary up to five on the Zynq-7020 SoC).
The AXI4-Stream Broadcast IP core forwards a single input stream to both outputs,
so that one of them can return via a one-way AXI DMA to DRAM. DDR bottlenecks
might arise, which is discussed in Section 5.4.3. A second solution method is to
create a top function in HLS calling all subfunctions and insert the HLS dataflow
directive. In order not to cause a single-producer, single-consumer violation, im-
portant modifications must however be introduced. If a single combined output
stream is desired, it does not make sense to bypass the blocks in the middle and
somehow recombine the differently phased streams at the end. Bypassing tasks is
even downright impossible with the pragma enabled [72], and various attempts to
do so anyway have been found to produce deadlocks during C/RTL cosimulation.
A feasible workaround that we have conceived is depicted in Figure 5.9b. Every
subkernel ought to attach its own output onto the existing stream to prevent the
data from being lost forever. This way, the elements grow wider as the computation
progresses. In the end, all intermediate results are stored in one fat output stream
and written back to DRAM. Remark that the technical implementation of this so-
lution is more burdensome than the first method: Stencil kernels in particular now
have to keep all aggregated data inside their line buffers, in order to ensure that the
information to be passed through remains correctly synchronized in time with the
newly computed data. The utilization of BRAM and registers slightly increases as
a result, but this should not pose a problem considering the line buffers’ relatively
small dimensions.5This statement is analogous to the fact that loop unrolling a.k.a. processing element duplication
usually increases resource utilization by a factor equal to the resulting gain in performance.
5.4. Task-level pipelining 93
(A) In Vivado’s block design: all kernels are kept fully separate during the HLSdesign phase. Unlabelled white arrows belong to the AXI4-stream protocol.
(B) Single HLS IP core combining all subkernels via dataflow. Inter-kernel resultsare accumulated into an increasingly larger stream (depicted here as multiple ar-rows). The dotted lines represent conceptual pass-through connections between
input and output.
FIGURE 5.9: Two possible solutions for intermediate output aggrega-tion (Figure 5.2b).
94 Chapter 5. System-level acceleration of multiple kernels
5.4.2 Multi-modal execution
The problem of multi-modal execution is defined in this context as how to efficiently
design the architecture for a chain of kernels whose instances have varying function-
ality and/or dimensionality while maximizing the degree of area utilization across
these modes. Figure 5.2c could trivially be implemented as three separate accelera-
tors. While this easily allows for the concurrent execution of different instances, it
will prevent Vivado from exploiting any degree of inter-modal overlap on the hard-
ware level during synthesis and implementation. Instead, we propose the usage
of multi-modal blocks such as in Figure 5.10a. Another distinction is made here
between kernels whose mode can be changed by setting parameters via their AXI
slave interface, and kernels that are more fundamentally different so that a custom
HLS stream selection block is inserted before and after them instead. Note that the
latter case, exemplified by IP cores Bi, does not conform to the area sharing princi-
ple but allows for completely unrelated kernels to exist next to each other instead.
The ability for the software programmer to select which block to activate in the cur-
rent chain might be required in some parts of the multi-modal execution process.
In KinectFusion for example, variants of both depth2vertex and vertex2normal exist
across all levels, although mm2m_sample and bilateral_filter are present in the highest
resolution level only. Section 5.4.3 will therefore insert stream switching components
to deal with the selection between the latter two kernels and half_sample.
Vivado HLS again provides an opportunity to resolve the multi-modal problem
in an earlier stage as well. The impossibility of conditional task execution is another
constraint of the dataflow optimization directive [72], although case selection withina loop is certainly not forbidden. This idea is illustrated in Figure 5.10b. Multiple
variants of kernels can be activated by controlling the mode parameter, and the free-
dom of if-else case switching allows for completely different functionality to coexist
within one HLS function as well. Our hypothesis is that this method decreases total
area utilization even further compared to the first solution, since resource sharing
across modes (for blocks that were previously distinct) can already occur during the
HLS scheduling and binding phases.
5.4.3 Application to KinectFusion
Having investigated how to tackle the two challenges constituting the multi-level
dataflow problem, its application to Figure 5.1 is considered. First, a necessary in-
termediate step is to determine which modes should correspond to which blocks
in order to reach an efficient configuration of combined kernels with respect to task-
level pipelining. Stated more generally, data dependencies between components can
be modelled as a directed graph. The task at hand is then to find an optimal set of
5.4. Task-level pipelining 95
(A) In Vivado’s block design, the components can be configured to instantiatedifferent kernel variants either by setting control signals, or by routing the switch
blocks.
(B) In HLS, the differentiation between several modes is done inside the loop bodies.
FIGURE 5.10: Two possible solutions for multi-modal execution (Fig-ure 5.2c).
96 Chapter 5. System-level acceleration of multiple kernels
FIGURE 5.11: Three different sets of paths (depicted as large arrows)that connect components to combine using task-level pipelining. Thetime for one path is estimated from the slowest block inside that path,and the paths should be executed separately in time to enable re-
source sharing across different modes.
paths, each consisting of pipelined components and corresponding to a certain exe-
cution time, so that all nodes are covered and the total sum of all timings is minimal.
Figure 5.11 shows three different overlays in the KinectFusion use case. While differ-
ent modes need not necessarily correspond to different resolution levels, in this case
it seems that the best way forward is the leftmost configuration. We now review two
practical ways in which the intermediate output aggregation and multi-modal exe-
cution techniques can be applied to KinectFusion: one way is via the block design,
and the other is via HLS itself. It is expected that both methods will generate better
results compared to the straightforward side-by-side placement of all accelerators
that was discussed in Section 5.4.
Block-level dataflow
Figure 5.12 depicts the architecture that combines elements of both Figures 5.9a and
5.10a. Note that the bit widths correspond to packed structs and the stream ele-
ments do not always correspond to meaningful units in KinectFusion. While this is
not always the case in general, it so happens that at every stage within this pipeline
of blocks, intermediate streams have to be extracted and redirected to DRAM via an
AXI DMA. The outputs of mm2m_sample and depth2vertex are a priori required by
the temporary storage in Figure 5.1, although this does not hold for the outputs of
bilateral_filter and half_sample. However, the latter data is needed for the next level of
kernels down the resolution pyramid, so that it inevitably needs to be stored away
as well. The processing element duplication factor of mm2m_sample was reduced
from 4 to 2 because the bottleneck of the first level (the leftmost path in Figure 5.11)
now resides with the two last kernels due to their 128-bit stream size. As a result,
5.4. Task-level pipelining 97
FIGURE 5.12: System architecture that handles the multi-leveldataflow challenge of KinectFusion’s first five kernels (see Figure 5.1)completely within the Vivado block design, leaving the HLS IP coresunchanged. AXI-Lite control signals are omitted for clarity, and thebottleneck-inducing streams are marked with a red data width label.
communication and computation bounds are now matched for all three levels.
Figure 5.13 depicts how which components should be started at which level for
every frame, and gives an indication of how long they take. The switch blocks
should of course be controlled accordingly. Every kernel only exists once in the
hardware so that no overlapped execution can occur, causing the iteration latency to
be equal to the initiation interval. The measured values are II = IL = 2.10 ms, which is
very close to the theoretical value of (320× 240+ 160× 120+ 80× 60)× 2× 10−5 =
2.02 ms. Despite the total throughput of data that is written to DRAM being quite
high, there is only one input stream to the whole FPGA. In contrast to the coexisting
98 Chapter 5. System-level acceleration of multiple kernels
FIGURE 5.13: Schedule to process incoming sensor frames using theimproved accelerators. Due to the application of task-level pipelin-ing, all subcomponents now adapt to the slowest link in the chain,
which is formed by bandwith limitations.
configuration where the input of every block had to be read from DRAM, the max-
imum throughput from PS to PL is 800 MB/s here because all intermediate results
are passed directly to subsequent components. We suspect that this lower value
removes the DDR bottleneck that was present previously. The average resource uti-
lization of this configuration is 45 %, or 7 % less than the previous architecture with
coexisting kernels since fewer communication infrastructure is present.
HLS-level dataflow
Both intermediate output aggregation and multi-modal kernels can already be intro-
duced in the HLS design as well, conceptually illustrated in Figures 5.9b and 5.10b.
Merging these techniques for KinectFusion yields Listing 5.1, where many details
including fixed-float conversion, variable loop bounds and TLAST signal handling
are omitted for clarity. Opportunities for resource sharing can now be taken ad-
vantage of more thoroughly, e.g. by fusing the bilateral_filter and half_sample kernels
together as much as possible. The window sizes of these Stencil kernels are 3x3 and
2x2 respectively, the biggest of which is determined by the bilateral filter. By using
one shared window for both routines, the effect of multi-modality is consequently
postponed to the actual Stencil computation given such a window filled with data.
The overarching HLS IP core can be implemented as one big accelerator on the
FPGA, shown in Figure 5.3. The resulting functionality is very similar to that of the
block-level dataflow design, except that the complex dataflow challenges are now
taken care of at a different level. To process multiple frames, the schedule depicted
in Figure 5.13 also applies to this case. However, measurements indicate that the ini-
tiation interval and iteration latency have increased to 4.13 ms. This unfortunate fact
is explained by the fact that only one AXI DMA is used to retrieve all output data.
The width of one element is 256 bits as it must contain at least two depth values of 32
bits each and two 3D points of 96 bits each. The 64-bit HP port at the PS-PL interface
5.4. Task-level pipelining 99
typedef struct float mm_out; // 4 bytes; empty for level >= 1float bf_out; // 4 bytes; hs_out for level >= 1point_t dv_out; // 12 bytespoint_t vn_out; // 12 bytes
agg_t;int mm_through_vn(hls::stream<agg_t>& stream_out,
hls::stream<int>& stream_in, int level) #pragma HLS DATAFLOW
hls::stream<agg_t> tmp1, tmp2, tmp3;// stream_in -> tmp1for (...)
if (level == 0) // mm2m_sample (Map) ...
else // pass through (Map) ...
// tmp1 -> tmp2; both kernels use a shared memory windowfor (...)
if (level == 0) // bilateral_filter (Stencil) ...
else // half_sample (Stencil) ...
// tmp2 -> tmp3for (...)
// depth2vertex (Map) ...// tmp3 -> stream_outfor (...)
// vertex2normal (Stencil) ...
LISTING 5.1: Code snippet summarizing how the multi-leveldataflow problem is to be solved within Vivado HLS.
100 Chapter 5. System-level acceleration of multiple kernels
is thus forced to chop up the accumulated data into four smaller packets, taking four
clock cycles per element to send them to DRAM. Another drawback of the HLS-level
dataflow solution is that all data is now returned in interleaved format and must be
deinterleaved by the PS in order to obtain the same separated array structures as
in the block-level dataflow solution. The FPGA cannot perform deinterleaving into
one output stream though, as this would again bring us back to unrealistically large
memory requirements. We suspect that deinterleaving is however not strictly re-
quired for all use cases.
Lastly, we give a remark on the average resource usage of 35 %, which is 10 %
lower than for the block-level counterpart. This fact can be attributed to the follow-
ing factors, albeit by an unknown weight for each item:
• The reduction in communication infrastructure. Only one AXI DMA is present
in hardware instead of four or five.
• The intrinsically higher degree of resource sharing by making blocks multi-
modal at an earlier stage in the design process. This is Section 5.4.2’s hypothe-
sis that we wanted to test.
• The improved handling of data type conversions.
All streams exposed to the PS are represented in floating point format by princi-
ple so that the CPU can understand their content. The block-level variant contains
the same IP cores as the very first configuration of coexisting kernels. This means
however that all intermediate inputs and outputs consist of floating point numbers,
even those in-between multiple blocks causing some redundant data type conver-
sions to occur. On the other hand, the HLS-level dataflow architecture has to per-
form fewer conversions in total thanks to the following design choice: inter-kernel
streams are left unconverted so that subsequent kernels do not have to unnecessarily
re-convert data from floating point to fixed-point representation. The hypothesis of a
fundamental improvement of the HLS-level configuration over its block-level coun-
terpart in terms of resource sharing therefore remains plausible but unconfirmed,
and should be tested more strictly by comparing architectures where such interfer-
ence due to nuisance variables is not present. However, in this particular case it
still holds that the resulting design is more hardware-efficient by 10 % on average
thanks to the engagement of the discussed extra opportunity for avoiding unneces-
sary computations.
5.5. Discussion 101
5.5 Discussion
This chapter explored several architectures of an embedded system that deal with
off-loading five distinct KinectFusion kernels at once. The complex datapath re-
quired us to solve two related problems along the way. The first one concerns the
retrieval of data streams from in-between functional blocks, and the second one in-
volved the efficient exploitation of the algorithm’s multi-level nature. Most impor-
tantly, three implementations were developed. A summary of how each configura-
tion solves the two aforementioned problems follows:
• Independent coexistence of kernels. In this architecture, all kernels are imple-
mented as separate accelerators. As such, there is little reason to worry about
dataflow because every stream immediately gets written back to DRAM. No
task-level pipelining is employed so that data dependencies are the only rea-
son kernels have to wait for each other. As soon as an output becomes avail-
able, from that point onwards it is stored permanently in DRAM: an unlimited
amount of other kernels who might need the data can therefore read the data
without issues. The blocks themselves are multi-modal by design. Frame-level
pipelining can be achieved by scheduling the execution of different compo-
nents and modes efficiently over time.
• Block-level or HLS-level dataflow. Next, by introducing task-level pipelin-
ing we proposed a family of two configurations that inserted small buffers in-
between kernels rather than imposing on them the burden of passing data via
DDR everytime. This concept can be applied at two levels: either in the block
design or in the HLS top function with a dataflow pragma. Calculated re-
sults that are needed in later stages of the algorithm yet would be lost without
corrective measures are either redirected via a separate AXI DMA core (block-
level aggregation), or accumulated throughout the chain of components until
the final output stream is reached (HLS-level aggregation). To switch across
multiple resolution levels, stream switching IP cores are inserted at the block-
level and hybrid kernels were implemented virtually by if-else case switch-
ing at the HLS-level. No frame-level pipelining is possible in this architecture
because the accelerator is built to be executed in an indivisible manner by de-
sign.
Our findings in this chapter related to multi-level executions closely match those
by Boikos et al. [3], who present an implementation of semi-dense SLAM on FPGA
where the units also support multiple data rates in addition to being multi-modal.
The authors confirm our conclusion that following the single-producer, single-consu-
mer principle and reusing hardware units while incorporating adjustable processing
paths leads to efficient designs. They did however not disclose precisely at which
102 Chapter 5. System-level acceleration of multiple kernels
Initiation Iteration Frame BRAM DSP FF LUTConfiguration interval latency rate [%] [%] [%] [%]Coexistence 2.53 ms 6.02 ms 395 FPS 24 46 48 88Block-level dataflow 2.10 ms 2.10 ms 476 FPS 25 46 40 69HLS-level dataflow 4.13 ms* 4.13 ms* 242 FPS* 16 51 25 48
TABLE 5.4: Comparison of timing and resource profiles after imple-menting mm2m_sample through vertex2normal as separate accelerators
versus applying both discussed multi-level dataflow techniques.
level this multiplexing between different pipelined operation paths occurs, the dis-
cernment and evaluation of which we believe is an important contribution of our
research.
Lastly, power usage was not taken into account in this chapter. While Vivado
does offer detailed information about static and dynamic power consumption of
post-implementation FPGA designs, it is much more cumbersome to get a holistic
view of the full heterogeneous system’s energy consumption (which is what actually
matters). Methods in literature to measure the power usage of embedded processing
systems are often ad hoc or even not mentioned at all, leading us to ignore this
aspect.
5.5.1 Comparison of timing and resource profiles
Our best achieved performance metrics are listed in Table 5.4. The block-level data-
flow configuration clearly dominates the coexistence configuration both in terms of
speed and resource usage, which is a positive result. This can be explained by the
fact that only four AXI DMAs are present on the FPGA in configuration 2, three of
which are one-way. In contrast, the first configuration has five two-way DMAs. The
third configuration, HLS-level dataflow, has a flaw in the form of a low bandwidth
bound. Another drawback is the output stream interleaving all accumulated data
(which is the reason for marking the timings with an asterisk). Note that configura-
tions 2 and 3 can be seen as two ends of an extremum, since only one DMA is used
in the latter variant. If instead two DMAs were employed to write the fat output
stream back to DRAM, then a two-fold increase in throughput might occur, bringing
the performance of both configurations on par. We therefore propose investigating
hybrid block- and HLS-level dataflow architectures as a possible direction for future
research.
103
Chapter 6
Conclusions and future work
The goal of this master’s dissertation was to implement KinectFusion on the Zynq-
7020 SoC on one hand, and to devise a set of workable guidelines by which to imple-
ment similar algorithms and kernels on the other hand. Throughout our research, it
quickly became apparent that the scope of full FPGA acceleration would have to be
limited to just a subset of all kernels.
First, a methodology was constructed in Chapter 3 that enables the designer to
correctly handle a range of parallel patterns often found in 2D image processing ap-
plications. Techniques that were elaborated and exemplified include pipelining, I/O
streaming, line buffering, array partitioning and scratchpad memories. The HLS re-
port summaries, performance and resource views were pinpointed as indispensable
tools when applying these procedures. Detailed investigations were also made with
respect to the impact of different data types on hardware utilization, and how Vi-
vado HLS sometimes creates extra overhead in the design by for example rounding
up memory sizes to the next power of two.
Next, Chapter 4 describes how eight KinectFusion kernels were sped up signifi-
cantly using the discussed concepts. It is also illustrated that these methods should
not be applied inconsiderately, as pitfalls might otherwise occur leading to subop-
timal designs. In addition, further opportunities for optimization were discovered
that required a deeper knowledge of the application itself. Both of these reasons led
us to conclude that increasing the degree of automation beyond HLS compilation
might adversely affect the quality of the resulting design.
Afterwards, a step back was taken in Chapter 5 to gain a system-level overview
of the whole application. The first half of KinectFusion was fully off-loaded to
the FPGA, and comparisons were made among different methods to reconcile its
complex dataflow with task-level pipelining and the accumulation of intermediate
streams. The challenge consisted of staying within bounds of the FPGA’s capabili-
ties while achieving the desired functionality without sacrifices. The best configura-
tion with respect to performance was found to arise from composition at the Vivado
104 Chapter 6. Conclusions and future work
ARM HLS FPGA FPGAKernel CPU estimate single-kernel multi-kernelmm2m_sample 2.70 ms 0.38 ms 0.78 ms
2.10 msbilateral_filter 426 ms 0.77 ms 0.78 mshalf_sample 1.82 ms 0.24 ms 0.50 msdepth2vertex 7.90 ms 1.01 ms 2.05 msvertex2normal 27.4 ms 1.01 ms 2.06 mstrack 125.7 ms 23.0 msreduce 27.7 ms 3.31 msintegrate 1236 ms 16.2 msraycast 1294 ms
TABLE 6.1: Time spent in each kernel when KinectFusion is executedon either the ARM Cortex-A9 CPU or Xilinx Zynq-7020 FPGA of the
embedded SoC.
block design level, while the least hardware resources seem to be used if all func-
tionality is combined within one large HLS IP core instead.
For convenience, the timings obtained throughout this dissertation are summa-
rized in Table 6.1. The HLS column is embodied by Chapter 4, while the FPGA
columns originate from Chapter 5. HLS reports indicated a median speed-up of
×8.10 of the first eight kernels, and FPGA executions revealed an actual speed-up
of ×222 of the first five kernels combined. Summing all eight HLS timings and di-
viding them by the sum of the first eight CPU timings yields a ratio of ×40.4. This
value can be treated as an estimated holistic (i.e. weighted average) speed-up factor,
comparing regular execution on the CPU with HLS estimates after optimizing all
eight kernels.
Finally, a remark on programmability is made. Throughout this thesis, we have
experienced how FPGAs remain notably difficult to program. Nonetheless, a pos-
itive evolution is undisputably present: thanks to the advances in HLS, familiarity
with low-level hardware details such as propagation delays and the architecture of
basic logic elements is not required anymore, which stands in stark contrast to a
decade ago [73]. However, in addition to the unique duality and high-dimensional
constraints of designing heterogeneous CPU-FPGA systems as described in the in-
troduction, the toolchains in existence were not found to be bug-free. We believe that
these facts pose challenges to the accessibility and popularity of designing FPGA
hardware, and efforts towards improving this workflow can only be encouraged.
6.1 Future work
From a practical standpoint, the most drastic speed-up was achieved for bilateral_filter,
although integrate and raycast are equally (if not more) important candidates. A clear
6.1. Future work 105
direction for future research is hence presented. More specifically, the HLS design
of raycast ought to be investigated in detail. Chapter 4 already elucidated several
reasons why its acceleration will be guaranteed to be highly sophisticated, but no
hard statements can be made unless it is tried at some point.1
Second, Table 4.2 made it clear that not all kernels will be able to fit together on a
low-end Zynq-7020 FPGA even by the sheer amounts of resource usages alone. Op-
portunities for higher-end FPGAs should therefore be researched, and/or systems
with multiple FPGAs placed in cascade. Perhaps an embedded GPU could be used
for raycast, should this last step be deemed unfit for FPGAs anyway. One advantage
of using a more expensive FPGA is that clock periods can be reduced further be-
low 10 ns, which would lead to even better performance results in addition to being
able to accommodate a larger number of algorithmic components at once. Another
advantage is their larger internal memory; this extra space makes the local caching
strategies for random data access discussed in Section 3.2.4 more appealing.
Third, the block-level and HLS-level architectures for multi-kernel acceleration
explored in Chapter 5 do not represent the full design space. Once treated as two
ends of a spectrum, a mixture of both concepts could be devised as to hopefully com-
bine the best of both worlds in terms of timing and resources. In addition, the effect
on hardware utilization of moving to one extremum or the other should be studied
more closely, as some uncontrolled variables were present in our experiments mak-
ing the results slightly less reliable. Lastly, the dataflow techniques should also be
evaluated more extensively by applying them to the remaining two relevant kernels,
track and reduce.
1Several paths can be undertaken here, including a fundamental transformation of the algorithmin order to use less image data overall. However, this would turn the fully dense SLAM applicationinto a semi-dense variant, which might not be desireable with respect to preserving the quality of thereconstruction and localization. Furthermore, recent solutions for semi-dense SLAM already exist [3],[42], [43], albeit are unrelated to KinectFusion.
107
Bibliography
[1] Xilinx, Vivado Design Suite User Guide: High-Level Synthesis v2018.2, 2018. [On-
line]. Available: https://www.xilinx.com/support/documentation/sw_
manuals/xilinx2017_4/ug902-vivado-high-level-synthesis.pdf.
[2] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid,
and J. J. Leonard, “Past, present, and future of simultaneous localization and
mapping: Toward the robust-perception age”, IEEE Transactions on Robotics,
vol. 32, no. 6, pp. 1309–1332, 2016, ISSN: 15523098. DOI: 10.1109/TRO.2016.
2624754. arXiv: arXiv:1606.05830v4.
[3] K. Boikos and C.-S. Bouganis, “A Scalable FPGA-based Architecture for Depth
Estimation in SLAM”, Applied Reconfigurable Computing, 2019. arXiv: 1902 .
04907. [Online]. Available: http://arxiv.org/abs/1902.04907.
[4] R. Nane, V. M. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen, H. Hsiao, S.
Brown, F. Ferrandi, J. Anderson, and K. Bertels, “A Survey and Evaluation of
FPGA High-Level Synthesis Tools”, IEEE Transactions on Computer-Aided De-sign of Integrated Circuits and Systems, vol. 35, no. 10, pp. 1591–1604, 2016, ISSN:
02780070. DOI: 10.1109/TCAD.2015.2513673.
[5] R. A. Newcombe, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J.
Shotton, S. Hodges, and A. Fitzgibbon, “KinectFusion: Real-Time Dense Sur-
face Mapping and Tracking”, Tech. Rep., 2011.
[6] B. Bodin, H. Wagstaff, S. Saeedi, L. Nardi, E. Vespa, J. H. Mayer, A. Nisbet, M.
Luján, S. Furber, A. J. Davison, P. H. J. Kelly, and M. O’Boyle, “SLAMBench2:
Multi-Objective Head-to-Head Benchmarking for Visual SLAM”, 2018. arXiv:
1808.06820. [Online]. Available: http://arxiv.org/abs/1808.06820.
[7] M. Abouzahir, A. Elouardi, R. Latif, S. Bouaziz, and A. Tajer, “Embedding
SLAM algorithms: Has it come of age?”, Robotics and Autonomous Systems,
2018, ISSN: 09218890. DOI: 10.1016/j.robot.2017.10.019.
[8] J. Engel, J. Sturm, and D. Cremers, “Semi-dense visual odometry for a monocu-
lar camera”, Proceedings of the IEEE International Conference on Computer Vision,
pp. 1449–1456, 2013. DOI: 10.1109/ICCV.2013.183.
[9] L. Nardi, B. Bodin, M. Z. Zia, J. Mawer, A. Nisbet, P. H. Kelly, A. J. Davi-
son, M. Luján, M. F. O’Boyle, G. Riley, N. Topham, and S. Furber, “Introduc-
ing SLAMBench, a performance and accuracy benchmarking methodology for
108 Bibliography
SLAM”, in Proceedings - IEEE International Conference on Robotics and Automa-tion, vol. 2015-June, 2015, pp. 5783–5790, ISBN: 9781479969234. DOI: 10.1109/
ICRA.2015.7140009. eprint: 1410.2167.
[10] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark
for the evaluation of RGB-D SLAM systems”, IEEE International Conference onIntelligent Robots and Systems, pp. 573–580, 2012, ISSN: 21530858. DOI: 10.1109/
IROS.2012.6385773.
[11] R. Mur-Artal and J. D. Tardos, “ORB-SLAM2: An Open-Source SLAM System
for Monocular, Stereo, and RGB-D Cameras”, IEEE Transactions on Robotics,
vol. 33, no. 5, pp. 1255–1262, 2017, ISSN: 1552-3098. DOI: 10.1109/TRO.2017.
2705103. arXiv: 1610.06475. [Online]. Available: http://ieeexplore.ieee.
org/document/7946260/.
[12] J. Engel, T. Schöps, and D. Cremers, “LSD-SLAM: Large-Scale Direct Monocu-
lar SLAM”, European Conference on Computer Vision, 2014, ISSN: 00201693. DOI:
10.1016/S0020-1693(00)81721-1.
[13] T. Whelan, R. F. Salas-Moreno, B. Glocker, A. J. Davison, and S. Leutenegger,
“ElasticFusion: Real-time dense SLAM and light source estimation”, Interna-tional Journal of Robotics Research, vol. 35, no. 14, pp. 1697–1716, 2016, ISSN:
17413176. DOI: 10.1177/0278364916669237.
[14] V. A. Prisacariu, O. Kähler, S. Golodetz, M. Sapienza, T. Cavallari, P. H. S. Torr,
and D. W. Murray, “InfiniTAM v3: A Framework for Large-Scale 3D Recon-
struction with Loop Closure”, 2017. arXiv: 1708.00783. [Online]. Available:
http://arxiv.org/abs/1708.00783.
[15] Y. Bai, M. Alawad, R. DeMara, and M. Lin, “Optimally Fortifying Logic Relia-
bility through Criticality Ranking”, Electronics, vol. 4, no. 1, pp. 150–172, 2015.
DOI: 10.3390/electronics4010150.
[16] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs”, IEEETransactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 26,
no. 2, pp. 203–215, 2007, ISSN: 02780070. DOI: 10.1109/TCAD.2006.884574.
[17] D. Koch, F. Hannig, and D. Ziener, FPGAs for software programmers. 2016, pp. 1–
327, ISBN: 9783319264080. DOI: 10.1007/978-3-319-26408-0.
[18] S. Asano, T. Maruyama, and Y. Yamaguchi, “Performance comparison of FPGA,
GPU and CPU in image processing”, FPL 09: 19th International Conference onField Programmable Logic and Applications, pp. 126–131, 2009. DOI: 10.1109/
FPL.2009.5272532.
[19] Tedway, What are fpgas and project brainwave? - azure machine learning service,
2019. [Online]. Available: https://docs.microsoft.com/en- us/azure/
machine-learning/service/concept-accelerate-with-fpgas.
Bibliography 109
[20] J. Fowers, G. Brown, P. Cooke, and G. Stitt, “A performance and energy com-
parison of FPGAs, GPUs, and multicores for sliding-window applications”,
p. 47, 2012. DOI: 10.1145/2145694.2145704.
[21] K. Rafferty, D. Crookes, F. Siddiqui, T. Deng, R. Woods, U. Minhas, and S.
Amiri, “FPGA-Based Processor Acceleration for Image Processing Applica-
tions”, Journal of Imaging, vol. 5, no. 1, p. 16, 2019. DOI: 10.3390/jimaging5010016.
[22] Xilinx, Zynq-7000 SoC: Technical Reference Manual v1.12.2, 2018. [Online]. Avail-
able: https://www.xilinx.com/support/documentation/user_guides/
ug585-Zynq-7000-TRM.pdf.
[23] ——, Vivado Design Suite: AXI Reference Guide v4.0, 2017. [Online]. Available:
https://www.xilinx.com/support/documentation/ip_documentation/axi_
ref_guide/latest/ug1037-vivado-axi-reference-guide.pdf.
[24] B. J. Svensson, “Exploring OpenCL Memory Throughput on the Zynq”, 2016.
[25] E. H. D’Hollander, “High-Level Synthesis Optimization for Blocked Floating-
Point Matrix Multiplication”, ACM SIGARCH Computer Architecture News, vol. 44,
no. 4, pp. 74–79, 2017, ISSN: 01635964. DOI: 10.1145/3039902.3039916.
[26] Avnet, Zynq Evaluation and Development Board: Hardware User’s Guide v2.2, 2014.
[Online]. Available: http://zedboard.org/sites/default/files/documentations/
ZedBoard_HW_UG_v2_2.pdf.
[27] Xilinx, Embedded vision solutions powered by xilinx, 2019. [Online]. Available:
https://www.xilinx.com/applications/megatrends/video-vision.html.
[28] E. Billauer, High resolution images of the zedboard, 2012. [Online]. Available: http:
//billauer.co.il/blog/2012/09/zedboard-zynq-images/.
[29] Xilinx, “Vivado Design Suite User Guide: Design Flows Overview v2018.2”,
in, 2018, ch. 1. [Online]. Available: https : / / www . xilinx . com / support /
documentation/sw_manuals/xilinx2018_2/ug892-vivado-design-flows-
overview.pdf.
[30] ——, “Vivado Design Suite User Guide: Embedded Processor Hardware De-
sign v2018.2”, in, 2018, ch. 3. [Online]. Available: https : / / www . xilinx .
com/support/documentation/sw_manuals/xilinx2018_2/ug898-vivado-
embedded-design.pdf.
[31] ——, “Vivado Design Suite User Guide: Programming and Debugging v2017.4”,
in, 2018, ch. 9-12. [Online]. Available: https://www.xilinx.com/support/
documentation/sw_manuals/xilinx2017_4/ug908-vivado-programming-
debugging.pdf.
[32] ——, Using xilinx sdk, 2017. [Online]. Available: https://www.xilinx.com/
html_docs/xilinx2017_4/SDK_Doc/index.html.
110 Bibliography
[33] ——, “Zynq-7000 All Programmable SoC: Embedded Design Tutorial: A Hands-
On Guide to Effective Embedded System Design v2017.4”, in, 2017. [Online].
Available: https://www.xilinx.com/support/documentation/sw_manuals/
xilinx2017_4/ug1165-zynq-embedded-design-tutorial.pdf.
[34] R. N. Appel and H. H. Folmer, “Analysis, optimization, and design of a SLAM
solution for an implementation on reconfigurable hardware (FPGA) using C?aSH”,
PhD thesis, 2016. [Online]. Available: http://essay.utwente.nl/71550/.
[35] V. Bonato, E. Marques, and G. A. Constantinides, “A floating-point extended
kalman filter implementation for autonomous mobile robots”, 2007 Interna-tional Conference on Field Programmable Logic and Applications, 2007. DOI: 10.
1109/fpl.2007.4380720.
[36] W. Fang, Y. Zhang, B. Yu, and S. Liu, “FPGA-based ORB feature extraction for
real-time visual SLAM”, in 2017 International Conference on Field-ProgrammableTechnology, ICFPT 2017, vol. 2018-Janua, 2018, pp. 275–278, ISBN: 9781538626559.
DOI: 10.1109/FPT.2017.8280159. arXiv: 1710.07312.
[37] Q. Gautier, A. Shearer, J. Matai, D. Richmond, P. Meng, and R. Kastner, “Real-
time 3D reconstruction for FPGAs: A case study for evaluating the perfor-
mance, area, and programmability trade-offs of the Altera OpenCL SDK”, in
Proceedings of the 2014 International Conference on Field-Programmable Technology,FPT 2014, 2015, pp. 326–329, ISBN: 9781479962457. DOI: 10.1109/FPT.2014.
7082810.
[38] M. Gu, K. Guo, W. Wang, Y. Wang, and H. Yang, “An FPGA-based Real-time
Simultaneous Localization and Mapping System”, no. 61373026, pp. 0–3, 2015.
[39] D. Törtei Tertei, J. Piat, and M. Devy, “FPGA design of EKF block accelerator
for 3D visual SLAM”, Computers and Electrical Engineering, vol. 55, pp. 1339–
1351, 2016, ISSN: 00457906. DOI: 10.1016/j.compeleceng.2016.05.003.
[40] B. W. Williams, J. Zambreno, and P. Jones, “Evaluation of a SoC for Real-Time
3D SLAM”, PhD thesis, 2017. [Online]. Available: https://lib.dr.iastate.
edu/etd.
[41] M. Abouzahir, A. Elouardi, S. Bouaziz, O. Hammami, and I. Ali, “High-level
synthesis for FPGA design based-SLAM application”, in Proceedings of IEEE/ACSInternational Conference on Computer Systems and Applications, AICCSA, 2017,
ISBN: 9781509043200. DOI: 10.1109/AICCSA.2016.7945638.
[42] K. Boikos and C. S. Bouganis, “Semi-dense SLAM on an FPGA SoC”, in FPL2016 - 26th International Conference on Field-Programmable Logic and Applications,
2016, ISBN: 9782839918442. DOI: 10.1109/FPL.2016.7577365.
Bibliography 111
[43] ——, “A high-performance system-on-chip architecture for direct tracking for
SLAM”, in 2017 27th International Conference on Field Programmable Logic andApplications, FPL 2017, 2017, ISBN: 9789090304281. DOI: 10.23919/FPL.2017.
8056831.
[44] O. Wasenmüller and D. Stricker, “Comparison of kinect v1 and v2 depth im-
ages in terms of accuracy and precision”, Lecture Notes in Computer Science (in-cluding subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioin-formatics), vol. 10117 LNCS, pp. 34–45, 2017, ISSN: 16113349. DOI: 10.1007/
978-3-319-54427-4_3.
[45] A. Handa, T. Whelan, J. Mcdonald, and A. J. Davison, “A Benchmark for RGB-
D Visual Odometry , 3D Reconstruction and SLAM”, IEEE International Con-ference on Robotics and Automation (ICRA), 2014. DOI: 10.1109/ICRA.2014.
6907054.
[46] F. Durand and J. Dorsey, “Fast Bilateral Filtering for the Display of High-
Dynamic-Range Images”, ACM Trans. Graph. (Proc. SIGGRAPH), pp. 257–266,
2002.
[47] E. Eade, “Lie Groups for Computer Vision”, Website, pp. 1–15, 2014. [Online].
Available: http://ethaneade.com/lie\_groups.pdf.
[48] L. Nardi, B. Bodin, M. Z. Zia, J. Mawer, A. Nisbet, P. H. Kelly, A. J. Davi-
son, M. Luján, M. F. O’Boyle, G. Riley, N. Topham, and S. Furber, Pamela-project/slambench, 2017. [Online]. Available: https://github.com/pamela-
project/slambench.
[49] B. Bodin, H. Wagstaff, S. Saeedi, L. Nardi, E. Vespa, J. H. Mayer, A. Nisbet,
M. Luján, S. Furber, A. J. Davison, P. H. J. Kelly, and M. O’Boyle, Pamela-project/slambench2, 2019. [Online]. Available: https://github.com/pamela-
project/slambench2.
[50] G. Reitmayr, Gerhardr/kfusion, 2013. [Online]. Available: https://github.com/
GerhardR/kfusion.
[51] E. Rosten, Toon: Tom’s object-oriented numerics library, 2018. [Online]. Available:
https://www.edwardrosten.com/cvd/toon.html.
[52] S. Zennaro, M. Munaro, S. Milani, P. Zanuttigh, A. Bernardi, S. Ghidoni, and E.
Menegatti, “Performance evaluation of the 1st and 2nd generation Kinect for
multimedia applications”, Proceedings - IEEE International Conference on Multi-media and Expo, vol. 2015-Augus, pp. 1–6, 2015, ISSN: 1945788X. DOI: 10.1109/
ICME.2015.7177380.
[53] M. D. McCool, “Structured Parallel Programming with Deterministic Patterns”,
Dr. Dobb Journal, no. June 2010, pp. 7–12, 2010, ISSN: 0960-1317. DOI: 10.1088/
0960-1317/5/3/002.
112 Bibliography
[54] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Ap-proach. 2007, p. 423, ISBN: 9780123704900.
[55] M. Schmid, N. Apelt, F. Hannig, and J. Teich, “An image processing library for
C-based high-level synthesis”, Conference Digest - 24th International Conferenceon Field Programmable Logic and Applications, FPL 2014, 2014. DOI: 10.1109/
FPL.2014.6927424.
[56] J. Hegarty, J. Brunhaver, Z. DeVito, J. Ragan-Kelley, N. Cohen, S. Bell, A. Vasi-
lyev, M. Horowitz, and P. Hanrahan, “Darkroom: Compiling High-Level Im-
age Processing Code into Hardware Pipelines”, ACM Transactions on Graphics,
vol. 33, no. 4, pp. 1–11, 2014, ISSN: 07300301. DOI: 10.1145/2601097.2601174.
[Online]. Available: http://dl.acm.org/citation.cfm?doid=2601097.
2601174.
[57] “Rigel: flexible multi-rate image processing hardware”, ACM Trans. Graph.,vol. 35, no. 4, 85:1–85:11, 2016.
[58] O. Reiche, M. A. Ozkan, R. Membarth, J. Teich, and F. Hannig, “Generat-
ing FPGA-based image processing accelerators with Hipacc: (Invited paper)”,
IEEE/ACM International Conference on Computer-Aided Design, Digest of Techni-cal Papers, ICCAD, vol. 2017-Novem, pp. 1026–1033, 2017, ISSN: 10923152. DOI:
10.1109/ICCAD.2017.8203894.
[59] Xilinx, Application Note: Zynq-7000 AP SoC: Demystifying the Lucas-Kanade Op-tical Flow Algorithm with Vivado HLS v1.0, 2017. [Online]. Available: https:
//www.xilinx.com/support/documentation/application_notes/xapp1300-
lucas-kanade-optical-flow.pdf.
[60] ——, Application Note: Vivado HLS: Implementing Memory Structures for VideoProcessing in the Vivado HLS Tool v1.0, 2012. [Online]. Available: https://www.
xilinx.com/support/documentation/application_notes/xapp793-memory-
structures-video-vivado-hls.pdf.
[61] ——, 7 Series FPGAs Memory Resources: User Guide v1.13, 2019. [Online]. Avail-
able: https://www.xilinx.com/support/documentation/user_guides/
ug473_7Series_Memory_Resources.pdf.
[62] B. Da Silva, A. Braeken, E. H. D’Hollander, and A. Touhafi, “Performance and
resource modeling for FPGAs using high-level synthesis tools”, Advances inParallel Computing, vol. 25, pp. 523–531, 2014, ISSN: 09275452. DOI: 10.3233/
978-1-61499-381-0-523.
[63] Xilinx, CORDIC v6.0: LogiCORE IP Product Guide, 2017. [Online]. Available:
https://www.xilinx.com/support/documentation/ip_documentation/
cordic/v6_0/pg105-cordic.pdf.
Bibliography 113
[64] ——, Reduce Power and Cost by Converting from Floating Point to Fixed Point v1.0,
2017. [Online]. Available: http://xilinx.eetrend.com/files- eetrend-
xilinx/download/201706/11535-30442-wp491-floating-fixed-point.pdf.
[65] L. Saldanha and R. Lysecky, “Float-to-fixed and fixed-to-float hardware con-
verters for rapid hardware/software partitioning of floating point software
applications to static and dynamic fixed point coprocessors”, Design Automa-tion for Embedded Systems, vol. 13, no. 3, pp. 139–157, 2009, ISSN: 09295585. DOI:
10.1007/s10617-009-9044-4.
[66] L. Yang, L. Zhang, H. Dong, A. Alelaiwi, and A. E. Saddik, “Evaluating and
improving the depth accuracy of Kinect for Windows v2”, IEEE Sensors Journal,vol. 15, no. 8, pp. 4275–4285, 2015, ISSN: 1530437X. DOI: 10.1109/JSEN.2015.
2416651.
[67] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images”,
1998, pp. 839–846.
[68] Xilinx, Xilinx OpenCV User Guide, 2017. [Online]. Available: https://www.
xilinx.com/support/documentation/sw_manuals/xilinx2017_4/ug1233-
xilinx-opencv-user-guide.pdf.
[69] R. Smeenk, Kinect v1 and kinect v2 fields of view compared, 2014. [Online]. Avail-
able: https://smeenk.com/kinect-field-of-view-comparison/.
[70] J. Lee, T. Ueno, M. Sato, and K. Sano, “High-productivity Programming and
Optimization Framework for Stream Processing on FPGA”, HEART 2018 Pro-ceedings of the 9th International Symposium on Highly-Efficient Accelerators andReconfigurable Technologies, pp. 1–6, 2018. DOI: 10.1145/3241793.3241798.
[71] P. Zhang, M. Huang, B. Xiao, H. Huang, and J Cong, “CMOST: A system-level
FPGA compilation framework”, Design Automation Conference (DAC), 2015 52ndACM/EDAC/IEEE, 2015, ISSN: 03649059. DOI: 10.1145/2744769.2744807.
[72] Xilinx, Vivado HLS Optimization Methodology Guide v2018.1, 2018. [Online]. Avail-
able: https : / / www . xilinx . com / support / documentation / sw _ manuals /
xilinx2018_1/ug1270-vivado-hls-opt-methodology-guide.pdf.
[73] W. MacLean, “An Evaluation of the Suitability of FPGAs for Embedded Vi-
sion Systems”, in 2005 IEEE Computer Society Conference on Computer Visionand Pattern Recognition (CVPR’05) - Workshops, IEEE, 2006, pp. 131–131. DOI:
10.1109/cvpr.2005.408.