Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Brigham Young University Brigham Young University
BYU ScholarsArchive BYU ScholarsArchive
Theses and Dissertations
2021-06-16
High-Speed Image Classification for Resource-Limited Systems High-Speed Image Classification for Resource-Limited Systems
Using Binary Values Using Binary Values
Taylor Scott Simons Brigham Young University
Follow this and additional works at: https://scholarsarchive.byu.edu/etd
Part of the Engineering Commons
BYU ScholarsArchive Citation BYU ScholarsArchive Citation Simons, Taylor Scott, "High-Speed Image Classification for Resource-Limited Systems Using Binary Values" (2021). Theses and Dissertations. 9097. https://scholarsarchive.byu.edu/etd/9097
This Dissertation is brought to you for free and open access by BYU ScholarsArchive. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of BYU ScholarsArchive. For more information, please contact [email protected].
High-Speed Image Classification
For Resource-Limited Systems
Using Binary Values
Taylor Scott Simons
A dissertation submitted to the faculty ofBrigham Young University
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
Dah-Jye Lee, ChairRandal Beard
Jeffery GoedersDavid Long
Department of Electrical and Computer Engineering
Brigham Young University
Copyright © 2021 Taylor Scott Simons
All Rights Reserved
ABSTRACT
High-Speed Image ClassificationFor Resource-Limited Systems
Using Binary Values
Taylor Scott SimonsDepartment of Electrical and Computer Engineering, BYU
Doctor of Philosophy
Image classification is a memory- and compute-intensive task. It is difficult to implementhigh-speed image classification algorithms on resource-limited systems like FPGAs and embeddedcomputers. Most image classification algorithms require many fixed- and/or floating-point oper-ations and values. In this work, we explore the use of binary values to reduce the memory andcompute requirements of image classification algorithms. Our objective was to implement thesealgorithms on resource-limited systems while maintaining comparable accuracy and high speeds.By implementing high-speed image classification algorithms on resource-limited systems like em-bedded computers, FPGAs, and ASICs, automated visual inspection can be performed on smalllow-powered systems. Industries like manufacturing, medicine, and agriculture can benefit fromcompact, high-speed, low-power visual inspection systems. Tasks like defect detection in man-ufactured products and quality sorting of harvested produce can be performed cheaper and morequickly. In this work, we present ECO Jet Features, an algorithm adapted to use binary values forvisual inspection. The ECO Jet Features algorithm ran 3.7× faster than the original ECO Featuresalgorithm on embedded computers. It also allowed the algorithm to be implemented on an FPGA,achieving 78× speedup over full-sized desktop systems, using a fraction of the power and space.We reviewed Binarized Neural Nets (BNNs), neural networks that use binary values for weightsand activations. These networks are particularly well suited for FPGA implementation and wecompared and contrasted various FPGA implementations found throughout the literature. Finally,we combined the deep learning methods used in BNNs with the efficiency of Jet Features to makeNeural Jet Features. Neural Jet Features are binarized convolutional layers that are learned throughdeep learning and learn classic computer vision kernels like the Gaussian and Sobel kernels. Thesekernels are efficiently computed as a group and their outputs can be reused when forming outputchannels. They performed just as well as BNN convolutions on visual inspection tasks and aremore stable when trained on small models.
Keywords: image classification, computer vision, FPGA, embedded systems, neural networks
ACKNOWLEDGMENTS
I would like to thank and acknowledge my advisor, Dr. Lee. He has always believed in and
supported me throughout this process. Through his example, he has helped me become a better
student, researcher, and person. I would like to thank my wife and children for filling my life with
joy. Taylor, Joan, and Harvey have been a foundation of comfort and reassurance that I have relied
on many times throughout graduate school and life. I also thank my parents for the many years
they have devoted to raising me, teaching me, and caring for me. I would like to thank all my
friends and classmates that helped me along the way. I would like to thank (and apologize to) all
my professors here at BYU that have inspired me and put up with me, especially when I thought I
was smarter than I actually was.
TABLE OF CONTENTS
TITLE PAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Traditional Image Classification . . . . . . . . . . . . . . . . . . . . . . . 31.1.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.1 ECO Jet Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.2 Binarized Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.3 Scaling Down Binarized Neural Networks . . . . . . . . . . . . . . . . . . 61.2.4 Neural Jet Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2 Jet Features: Hardware-Friendly, Learned Convolutional Kernels forHigh-Speed Image Classification . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Jet Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Introduction to Jet Features . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.2 Multiscale Local Jets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 The ECO Features Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 The ECO Jet Features Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 Jet Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4.2 Advantages in Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4.3 Advantages in Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.5.1 The Jet Features Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.5.2 The Random Forest Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.6.2 Accuracy on MNIST CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . 252.6.3 Accuracy on BYU Fish dataset . . . . . . . . . . . . . . . . . . . . . . . . 252.6.4 Software Speed Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 262.6.5 Hardware Implementation Results . . . . . . . . . . . . . . . . . . . . . . 29
2.7 Smart Camera System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
iv
2.7.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.7.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Chapter 3 A Review of Binarized Neural Networks . . . . . . . . . . . . . . . . . . . 333.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Network Quantization Techniques . . . . . . . . . . . . . . . . . . . . . . 363.3.2 Early Binarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 An Introduction to BNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.4.1 Binarization of Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.4.2 Binarization of Activations . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4.3 Bitwise Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.4.4 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.4.5 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.4.6 Robustness to Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Major BNN Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.5.1 The Original BNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.5.2 XNOR-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.5.3 DoReFa-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.5.4 Tang et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.5.5 ABC-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.5.6 BNN+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.5.7 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.6 Improving BNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.6.1 Scaling with a Gain Term . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.6.2 Using Multiple Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.6.3 Partial Binarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.6.4 Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.6.5 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.6.6 More Binarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.6.7 Batch Normalization and Activations as a Threshold . . . . . . . . . . . . 553.6.8 Layer Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7 Comparison of Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.7.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.7.2 Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.7.3 Table of Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.8 Hardware Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.8.1 FPGA Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.8.2 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.8.3 High-Level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643.8.4 Comparison of FPGA Implementations . . . . . . . . . . . . . . . . . . . 643.8.5 ASICs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
v
Chapter 4 Using Full Precision Methods to Scale Down Binarized Neural Networks . 674.1 Depthwise Separable Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1.1 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.1.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Direct Skip Connections in BNNs . . . . . . . . . . . . . . . . . . . . . . . . . . 694.2.1 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.2.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Chapter 5 Neural Jet Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.2 Neural Jet Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2.1 Constrained Neural Jet Features . . . . . . . . . . . . . . . . . . . . . . . 775.2.2 Jet Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.2.3 Computational Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.3.1 Model Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.3.2 BYU Fish Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.3.3 BYU Cookie Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.3.4 MNIST Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Chapter 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886.2 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
vi
LIST OF TABLES
2.1 ECO Feature Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2 ECO Jet Feature FPGA Resource Usage . . . . . . . . . . . . . . . . . . . . . . . . . 282.3 ECO Feature FPGA Resource Usage Per Unit . . . . . . . . . . . . . . . . . . . . . . 29
3.1 XNOR Operation Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.2 Accuracy of BNNs on the CIFAR-10 Dataset . . . . . . . . . . . . . . . . . . . . . . 493.3 Accuracy of BNNs on the ImageNet Dataset . . . . . . . . . . . . . . . . . . . . . . . 503.4 Overview of BNN Models and Methods . . . . . . . . . . . . . . . . . . . . . . . . . 503.5 Accuracy of BNNs on the MNIST Dataset . . . . . . . . . . . . . . . . . . . . . . . . 593.6 Accuracy of Non-binary Models on the MNIST Dataset . . . . . . . . . . . . . . . . . 603.7 Accuracy of BNNs on the SVHN Dataset . . . . . . . . . . . . . . . . . . . . . . . . 603.8 Accuracy of Non-binary Models on the SVHN Dataset . . . . . . . . . . . . . . . . . 603.9 Accuracy of BNNs on the CIFAR-10 Dataset . . . . . . . . . . . . . . . . . . . . . . 613.10 Accuracy of Non-binary Models on the CIFAR-10 Dataset . . . . . . . . . . . . . . . 613.11 Accuracy of BNNs on the ImageNet Dataset . . . . . . . . . . . . . . . . . . . . . . . 623.12 Comparison of BNNs Implemented on FPGAs . . . . . . . . . . . . . . . . . . . . . . 65
5.1 Layer Sizes for Neural Jet Feature Models . . . . . . . . . . . . . . . . . . . . . . . . 82
vii
LIST OF FIGURES
1.1 Computational and Memory Cost of Image Classification Systems . . . . . . . . . . . 2
2.1 Example of a Separable Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Jet Feature Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Jet Feature Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Example of ECO Feature Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5 ECO Feature Mutation and Crossover . . . . . . . . . . . . . . . . . . . . . . . . . . 142.6 ECO Feature Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.7 The Original Features Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.8 Eco Jet Mutation and Crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.9 ECO Jet Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.10 Single-Buffer Convolution Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.11 ECO Jet Accuracy vs Jet Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.12 ECO Jet Features Accuracy vs Node Count . . . . . . . . . . . . . . . . . . . . . . . 232.13 Examples from the MNIST and CIFAR-10 datasets . . . . . . . . . . . . . . . . . . . 242.14 Examples from the BYU Fish dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.15 ECO Features Accuracy on the CIFAR-10 Dataset . . . . . . . . . . . . . . . . . . . . 262.16 ECO Features Accuracy on the MNIST Dataset . . . . . . . . . . . . . . . . . . . . . 272.17 ECO Jet Accuracy on the BYU Fish Dataset . . . . . . . . . . . . . . . . . . . . . . . 282.18 Smart Camera System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 The Sign Layer/Straight-Through Estimator . . . . . . . . . . . . . . . . . . . . . . . 393.2 Topology of the original Binarized Neural Networks . . . . . . . . . . . . . . . . . . . 58
4.1 Standard Convolution Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.2 Depthwise Separable Convolution Filters . . . . . . . . . . . . . . . . . . . . . . . . . 684.3 Depthwise Separable Convolution Topology . . . . . . . . . . . . . . . . . . . . . . . 704.4 Accuracy of Depthwise Separable BNNs on the CIFAR-10 Dataset . . . . . . . . . . . 714.5 Direct Skip Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.6 Accuracy of Direct Skip Connections on the CIFAR-10 Dataset . . . . . . . . . . . . . 73
5.1 Jet Feature Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.2 The ”Constrained” Neural Jet Feature Kernels . . . . . . . . . . . . . . . . . . . . . . 785.3 Neural Jet Feature Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.4 Neural Jet Feature Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.5 Neural Jet Feature Accuracy on the BYU Fish Dataset . . . . . . . . . . . . . . . . . . 835.6 The BYU Cookie Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.7 Neural Jet Feature Accuracy on the BYU Cookie Dataset . . . . . . . . . . . . . . . . 845.8 Neural Jet Feature Accuracy on the MNIST Dataset - 16 Filter and 32 Dense Units . . 865.9 Neural Jet Feature Accuracy on the MNIST Dataset - 32 Filter and 128 Dense Units . . 865.10 Neural Jet Feature Accuracy on the MNIST Dataset - 64 Filter and 256 Dense Units . . 87
viii
CHAPTER 1. INTRODUCTION
Image classification is one of the most popular computer vision tasks. Many industries like
agriculture, medicine, and manufacturing, are adopting automatic image classification computer
systems to perform visual inspection. In the past, these industries have relied on humans to sort
objects and images. By employing computer systems to sort images, visual inspection can be
performed faster, cheaper, and in environments where human visual inspection cannot be used.
Image classification is usually computationally expensive and requires large amounts of memory.
In this work, we explore ways in which binary values and bitwise operations can be used in place
of fixed- and floating-point value and operations to reduce the computational costs and memory
requirements of image classification systems.
We targeted embedded computer systems and Field Programmable Gate Arrays (FPGAs)
as low powered, compact platforms for high speed image classification. These small form-factor
platforms can be installed in locations where bulky GPU/CPU systems cannot. They achieve high
processing speeds and require less power, but lack the abundance of memory and computational
resources that are afforded by full-sized GPU/CPU systems. By using binary digits, only a sin-
gle bit is required to represent each value, either +1 or −1. These binary values replace higher
precision floating- and fixed-point values. This greatly simplifies the algorithm’s arithmetic opera-
tions and reduces the size of the classification model. Simplified arithmetic and bitwise operations
are especially well suited for FPGAs, which manipulate signals at the bit level. These operations
are also well suited for embedded CPUs where large amounts of floating-point calculations prove
cumbersome.
This work explores the use of binary values in both traditional and modern computer vi-
sion techniques, combining them to take advantage of the elegance of traditional systems and the
convenience and power of modern Deep Learning (DL) methods.
1
1.1 Image Classification
Image classification algorithms begin by processing an image pixel by pixel. From these
pixel level computations, higher levels of abstractions are computed. Algorithms usually require
multiple levels of abstraction to be processed before an input image can be classified, which can
require large amounts of memory. Figure 1.1 illustrates how pixel and local level processing is
computationally expensive and the global classification towards the end of the model is memory
intensive. This applies to both traditional and DL algorithms. Traditional image classification
systems have two distinct parts, a computationally intensive image processing front end and a
memory-intensive classification model on the back end. DL models have a single model that
transitions from low level computations to memory-intensive fully connected layers at the backend.
Both of these concepts are shown in Figure 1.1.
Figure 1.1: Image classification tasks generally require large amounts of memory and compu-tational resources. In general, the first part of image classification algorithms is computationallyintensive and the latter parts are memory intensive. Traditional image classification algorithms splitthese two parts into distinct processes with image processing followed by a classification model.Deep Learning models blend these parts with computationally-intensive early convolutional layersfollowed by memory-intensive fully connected layers.
2
Many mainstream models target general image classification tasks with thousands of pos-
sible classes to choose from [1]. In this work, we focus on visual inspection tasks, rather than
general classification. Visual inspection tasks only need to cover a few different classes while
general classification can classify images into thousands of different classes. Automatic visual
inspection, used in fields like manufacturing and agriculture, requires high-speed image classifi-
cation with fixed camera conditions and a limited number of possible image classifications. Our
work is directed towards these types of applications where speed, power, and size are important
factors while the images being classified have less variance than the diverse set of images that are
often found in most state-of-the-art classification datasets.
We combined the simplicity of traditional image classification algorithms with the con-
venience and power of Machine Learning (ML) and Convolutional Neural Networks (CNN). We
targeted an intersection between these two paradigms through the use of binary values and bitwise
operations instead of floating-point or fixed-point values and operations.
1.1.1 Traditional Image Classification
Traditionally, image classification algorithms begin by using image processing and com-
puter vision to extract key features from an image. These features are fed into ML models that
organize and make sense of these features. Feature descriptors such as SIFT [2], SURF [3],
and ORB [4] have been used in conjunction with support vector machines (SVMs) [5], decision
trees [6], and bag of visual words (BoVWs) models [7].
Discrete convolutions between input images and filter kernels are at the heart of the image
processing used in image classification. Convolution kernels are used for a variety of tasks, such
as determining the scale and orientation of image features and reducing noise in an image. Con-
volution operations, which require many multiplication and addition operations, usually comprise
the majority of the computations within the algorithm.
ML models, used to perform the actual classification, are usually trained from a large set
of example images. These models either require that examples are stored away for reference [8] or
they aggregate the examples together to form a model that can generalize the learned information
well [9]. The classification models generally require the majority of the memory used by the
algorithm.
3
1.1.2 Convolutional Neural Networks
While traditional image classification pipelines separate image processing and classifica-
tion models into distinct segments, convolutional neural networks (CNNs) used a single unified
model that performs both tasks [10]. In CNNs, both the classification model and convolution ker-
nels are learned simultaneously during training. CNNs are the current state-of-the-art for image
classification tasks and have achieved much higher accuracies than traditional methods on most
tasks.
LeCun et al. first proposed CNNs in 1998 with the LeNet model [10] which was capable of
classifying small images of handwritten single digit numbers. It wasn’t until 2012 that CNNs began
to be widely accepted as the state-of-the-art when the AlexNet model [11] won the ImageNet com-
petition [1] which involved 1000 different possible classifications of large photographed images.
Since then, major improvements have been made to CNNs. AlexNet has since been surpassed by
other CNN models like VGG [12], ResNet [13], DenseNet [14] and ResNext [15].
Most CNNs for classification are composed of a series of convolutional layers followed by
fully connected layers. These convolutional layers are composed of many more convolution op-
erations than traditional image processing techniques, and like the image processing in traditional
algorithms, these convolutions require the majority of operations in the model. Fully connected
layers consist of large matrix multiplications. These matrices require the majority of the model’s
memory. The convolution kernels and matrices used in the fully connected layers are both learned
by the deep learning algorithm during training. CNNs are known to require more operations and
memory than traditional image classification methods.
1.2 Overview
1.2.1 ECO Jet Features
Chapter 2 details our efforts to use binary values to reduce the computational cost of an
image classification system targeted at a traditional image classification algorithm, the Evolution
Constructed Features (ECO Features) [16] algorithm. This allowed the algorithm to be easily
implemented on an FPGA for low-power, high-speed image classification. We also showed that our
4
binary convolutions significantly improved the algorithm’s speed on embedded computer systems
and FPGAs, while maintaining classification accuracy.
The ECO Features algorithm uses a traditional image classification pipeline, beginning
with image processing operations followed by ML classification. ECO Features use a genetic
algorithm to construct image processing operations that are then fed into an ML model. The
image processing operations used by this algorithm are not always FPGA friendly nor conducive
to resource-constrained systems. In Chapter 2 we point out that the operations most commonly
selected by the genetic algorithm were convolution filters. We broke these commonly used filters
into a series of smaller convolutions that only used binary weights. We call the filters constructed
this way Jet features since they build up a set scaled partial derivatives similar to n-jet sets [17].
These Jet features include or approximate many popular image filters such as the Gaussian filter,
Sobel filter, and Laplacian filter and can all be calculated simultaneously.
Jet features do not require any multiplication and can be easily implemented to run in
parallel with each other. With Jet features, the algorithm experiences a 3.7× speed up on CPUs
while maintaining the same level of accuracy. In addition, the operations in the algorithm became
simple enough to easily implement in an FPGA, which is not feasible using the original algorithm.
1.2.2 Binarized Neural Networks
In Chapter 3 we review Binarized Neural Networks (BNNs). BNNs are Neural Networks
that use binary values for model weights and neural activations. We outlined the major develop-
ments that have been made to BNNs. We reviewed and summarized the various techniques used to
improve their accuracy. We compared proposed BNN implementations on FPGAs and FPGA/CPU
co-processing systems in terms of accuracy, speed, power, and resources used.
Normally, it is not possible to perform standard training using binary values in CNNs. The
backpropagation method requires the use of continuous values and functions. Initial efforts to
use binary values in CNNs used pre-trained full-precision models, then approximated these full
precision values using binary ones [18]. Chapter 3 outlines the basic methods that allow Neural
Networks to be trained using binary values. We summarized the major developments that have
been made throughout the literature to make BNNs more effective. We also reviewed the various
efforts to implement BNNs on FPGA and FPGA/CPU hybrid systems.
5
1.2.3 Scaling Down Binarized Neural Networks
Most implementations of BNNs throughout the literature either require large FPGAs or
FPGA/CPU hybrid systems. Chapter 4 looks at methods that were originally designed for full
precision neural networks and applies them to BNNs in an attempt to reduce their recourse re-
quirements on FPGAs. Our experiments showed that these methods are not particularly well suited
for BNNs. They reduced the FPGA requirements for FPGA implementations but also hurt their
classification accuracy. These results motivated us to find techniques specifically tailored to BNNs
in order to reduce their size, which we present in Chapter 5.
1.2.4 Neural Jet Features
Chapter 5 introduces Neural Jet Features. Most BNN developments of the last few years
focused on making BNNs more accurate on large, complex datasets, like the ImageNet dataset [1].
These BNNs tended to be large and increasingly expensive. Our goal in this dissertation was to use
binary values to perform high-speed image classification with limited computational resources, like
the ECO Jet Features algorithm. In order to accomplish this, we developed the Neural Jet Features
layer. This convolutional layer can be used to replace the standard BNN convolutional layer with
Jet Features that are learned through deep learning.
The Neural Jet Features layer learns classic computer vision kernels but is trained as a
single end-to-end system. The image features are trained at the same time as the fully connected
classifier, like BNNs, which is not possible in the ECO Features algorithm. This convolutional
layer requires fewer operations and weights than the standard BNN convolutional layer but main-
tains similar accuracy on visual inspection tasks.
1.3 Contributions
This dissertation focuses on reducing the computational and memory costs of image clas-
sification systems through the use of binary values. We developed Jet Features, a novel set of
convolutional kernels that are constructed from smaller binary-valued convolution operations. We
applied Jet Features to the existing ECO Features algorithm which reduced its computational and
6
memory costs and allowed it to be implemented on an FPGA while maintaining comparable clas-
sification accuracy.
BNNs are not new to this work, but we reviewed the existing literature with a special focus
on FPGA implementations. We highlighted the major developments in the field of BNNs and com-
pared the various implementations. We compiled a list of the various FPGA implementations, the
topologies they used, the techniques they employed, the platforms they targeted, and the resources
they required.
Neural Jet Features are unique to this work. They are a combination of our prior efforts
introduced in this work. They combine BNN techniques with Jet Features and allow BNNs to be
implemented in smaller FPGAs without sacrificing accuracy.
7
CHAPTER 2. JET FEATURES: HARDWARE-FRIENDLY, LEARNED CONVOLU-TIONAL KERNELS FOR HIGH-SPEED IMAGE CLASSIFICATION
In this chapter, we present a set of learned convolutional kernels which we call Jet Fea-
tures. Jet Features are convolutional kernels that are formed from a series of convolutions of small
binary-valued convolution kernels. These small binary-valued kernels make Jet Features efficient
to compute in software, easy to implement in hardware, and perform well on visual inspection
tasks. Because Jet Features can be learned, they can be used in machine learning algorithms.
Using Jet Features we make significant improvements on previous work by Lillywhite et al., the
Evolution Constructed Features (ECO Features) algorithm [19]. We gained a 3.7x speedup in soft-
ware without losing any accuracy on the CIFAR-10 and MNIST datasets. Jet Features also allowed
us to implement the algorithm in a Field Programmable Gate Array (FPGA) using only a fraction
of its resources.
2.1 Introduction
The field of computer vision has come a long way in solving the problem of image classi-
fication. Not too long ago, handcrafted convolutional kernels were a staple of all computer vision
algorithms. With the advent of Convolutional Neural Networks(CNNs), handcrafted features have
become the exception rather than the rule, and for good reason. CNNs have taken the field of com-
puter vision to new heights by solving problems that used to be unapproachable or unthinkable.
With deep learning, convolutional kernels can be learned from patterns seen in the data rather than
pre-constructed by algorithm designers.
While CNNs are the most accurate solution to many computer vision tasks, they require
many parameters and many calculations to achieve such accuracy. In this work, we seek to speed
up image classification on simple tasks by leveraging some of the mathematical properties found
in classic handcrafted kernels and applying them in a procedural way with machine learning.
8
Convolutions with Jet Features are efficient to compute in both hardware and software.
They take advantage of redundant calculations during convolution operations and use only the
simplest kernels. We applied these features to our previous machine learning image classification
algorithm, the Evolution Constructed Features (ECO Features) algorithm. We call this new version
of the algorithm the Evolution Constructed Jet Features (ECO Jet Features) algorithm. It was accu-
rate on visual inspection tasks, and can be efficiently run on embedded computer devices without
the need for GPU acceleration. We specifically designed Jet Features to allow the algorithm to be
implemented on an FPGA, which will be referred to as our hardware implementation.
We tested a software implementation and a hardware implementation of our algorithm to
show the speed and compactness of the algorithm. Our hardware design is fully pipelined and
gives visual inspection results as soon as the image reaches the end of the data pipeline. This
hardware architecture was implemented on a Field Programmable Gate Array (FPGA), but could
be integrated into a system on a chip in custom silicon, where it could perform at an even faster
rate while using less power.
The original ECO Features algorithm [19] [16] has been used in various industrial applica-
tions. Many of these applications require high-speed visual inspection, where speed is important
but the identification task is fairly simple. In this work, we speed up the ECO Features algorithm,
allowing it to run 3.7 times faster in a software implementation while maintaining the same level
of accuracy on the MNIST and CIFAR-10 datasets. These improvements also made the algorithm
suitable for full parallelization and pipelining in hardware, which runs over 70 times faster in an
FPGA. The key innovations we present here are the development and use of Jet Features and the
development of a hardware architecture for our design.
2.2 Jet Features
Jet Features are convolutional kernels with special structures that allow for efficient con-
volutions. They are meaningful features in the visual domain and allow for elegant hardware
implementation. In fact, some of the most popular classical handcrafted convolutional kernels
qualify as Jet Features, like the Gaussian, Sobel, and Laplacian kernels. However, Jet Features are
not handcrafted, they are learned features that leverage some of the intuition behind these classic
9
kernels. Mathematically, they are related to multiscale local jets [17], which is reviewed in Section
2.2.2, but we introduce them here in a more informal manner.
2.2.1 Introduction to Jet Features
Jet Features are convolutional kernels that can be separated into a series of very small
kernels. In general, separable kernels are kernels that perform the same operation as a series of
convolutions with smaller kernels. Figure 2.1 shows an example of a 3x3 convolutional kernel that
can be separated into a series of convolutions with a 3x1 kernel and a 1x3 kernel. Jet Features take
separability to an extreme, being separated into the smallest meaningful sized kernels with only 2
elements. Specifically, all Jet Features can be separated into a series of convolutions with kernels
from the set {[1,1], [1,1]T , [1,−1] and [1,−1]T}, which are also shown in Figure 2.2. We refer to
these small kernels as the Jet Feature building blocks. Two of these kernels, [1,1] and [1,1]T , can
be seen as blurring factors or scaling factors. We will refer to them as scaling factors. The other
two kernels, [1,−1] and [1,−1]T , apply a difference between pixels in either the x or y direction
and can be viewed as simple partial derivative operators. We will refer to them as partial derivative
operators. All Jet Features are a series of convolutions with any number of these basic building
blocks. With these building blocks, some of the most popular classic filters can be constructed.
In Figure 2.3 we show how the Gaussian and Sobel filters can be broken down into Jet Feature
building blocks.
Figure 2.1: An example of a separable filter. A 3x3 Gaussian kernel can be separated into twoconvolutions with smaller kernels.
It is important to note that the convolution operation is commutative and the order in which
the Jet Feature building blocks are applied does not matter. Therefore, every Jet Feature is defined
by the number of each building block it uses. For example, the 3x3 x-direction Sobel kernel can
10
Figure 2.2: The four basic kernels of all Jet Features. The top two kernels can be thought of asscaling or blurring factors. The bottom two perform derivatives in either the x- or y-direction.Every Jet Feature is simply a series of convolutions with any number of each of these kernels. Theorder does not matter.
Figure 2.3: These examples demonstrate how the Gaussian kernel and Sobel kernels are examplesof Jet Features. These 3x3 kernels can be broken down into a series of four convolutions with thetwo cell Jet Feature kernels. The Sobel kernels are similar to the Gaussian, but one of the scalingfactors is replaced with a partial derivative.
11
be defined as 1 x-direction and 2 y-direction scaling factors and 1 x-direction partial derivative
operator (see Figure 2.3).
2.2.2 Multiscale Local Jets
We can more formally define a jet feature as an image transform that is selected from a
multiscale local jet. All features for the algorithm are selected from the same multiscale local jet.
Multiscale local jets were proposed by Florack et al. [17] as useful image representations that could
capture both the scale and spatial information within an image. They have proven to be useful for
various computer vision tasks such as feature matching, feature tracking, image classification, and
image compression [20] [21] [22]. Manzanera constructed a single unified system for several of
these tasks using multiscale local jets and attributed its effectiveness to the fact that many other
features are implicitly contained within a multiscale local jet [22]. Some of these popular features
include the Gaussian blur, the Sobel operator, and the Laplacian filter.
Multiscale local jets are a set of partial derivatives of a scale space of a function. Members
of a multiscale local jet have been previously defined in [17] [20] and [22] as
Lxmynσ (A) = A∗δxmynGσ , (2.1)
where A is an input image, δxmyn is a differential operator to the degree of m with respect to x and
degree n with respect to y and Gσ is the Gaussian operator with a variance of σ . A multisacle local
jet is a set of outputs Lxmynσ (A) for a given range of values for m,n, and σ :
Λxayb[σc,σd ](A) = {Lx0y0σc
(A), ...,Lxaybσd(A)}︸ ︷︷ ︸
for all combinations between
(2.2)
2.3 The ECO Features Algorithm
The ECO Features algorithm was originally developed in [19] and [16]. Its main purpose
was to automatically construct good image features that could be used for classification. This elim-
inated the need for man experts to hand craft features for specific applications. This algorithm was
developed as CNNs were gaining popularity, which solved similar problems [11]. We recognize
12
that CNNs are able to achieve better accuracy than the ECO Features algorithm in most tasks,
but ECO Features are smaller and generally less computationally expensive. In this chapter we
are interested in the effectiveness of Jet Features in the ECO Features algorithm. The impact of
Jet Features are fairly straightforward to explore when working with the ECO Features algorithm.
Exploration of Jet Features in CNNs is left for future work.
An ECO Feature is a series of image transforms performed back to back on an input image.
Figure 2.4 shows an example of a hypothetical ECO Feature. Each transform in the feature can
have a number of parameters that change the effects of the transform. The algorithm starts with
a predetermined pool of transforms which are selected by the user. Table 2.1 shows the pool of
transforms used in [16].
Figure 2.4: An example of a hypothetical ECO Feature made up of three transforms. The topboxes represent the type of each transform. The boxes below show each transform’s associatedparameters. The number of transforms, transform types and parameters of each transform arerandomly initialized and then evolved through a genetic algorithm.
The genetic algorithm initially forms ECO Features by selecting a random series of trans-
forms and randomly setting each of their parameters. The parameters of each transform are modi-
fied through the process of mutation in the genetic algorithm. New orderings of the transforms are
also created as pairs of ECO Features are joined together through genetic crossover, where the first
part of one series is spliced with the latter portion of a different series. A graphical representation
of mutation and crossover is shown in Figure 2.5.
Each ECO Feature is paired with a classifier. An example is given in Figure 2.6. Originally,
single perceptrons were used as the classifiers for each ECO Feature. Since perceptrons are only
capable of binary classification, we seek to extend the capabilities of the algorithm to perform
multiclass classification. We replaced the perceptrons with random forest [23] classifiers in this
work. Inputs are fed through the ECO Feature transforms and the outputs are fed into the classifier.
A hold set of images is then used to evaluate the accuracy of each ECO Feature. This accuracy is
13
Table 2.1: The pool of possible image transforms used in the ECO Features Algorithm.
Transform Parameters Transform ParametersGabor filter 6 Sobel operator 4
Gradient 1 Difference of Gaussians 2
Square root 0 Morphological erode 1
Gaussian blur 1 Adaptive thresholding 3
Histogram 1 Hough lines 2
Hough circles 2 Fourier transform 1
Normalize 3 Histogram equalization 0
Log 0 Laplacian Edge 1
Median blur 1 Distance transform 2
Integral image 1 Morphological dilate 1
Canny edge 4 Harris corner strength 3
Rank transform 0 Census transform 0
Resize 1 Pixel statistics 2
Figure 2.5: An example of mutation (top) and crossover (Bottom). Mutation will only change theparameters of a given ECO Feature. Crossover takes the first part of one feature and appends thelatter part of another feature to it.
14
used as a fitness score when performing genetic selection in the genetic algorithm. ECO Features
with high fitness scores are propagated to future rounds of evolution while weak ECO Features die
off. The genetic algorithm continues until a single ECO Feature outperforms all others for a set
number of consecutive generations. This ECO Feature is selected and saved while all others are
discarded. This process is repeated N times where N is the number of desired ECO Features.
Figure 2.6: An example pairing of an ECO Feature with a random forest classifier. Every ECOFeature is paired with its own classifier. Originally, perceptrons were used, but in our work, randomforests are used which offer multiclass classification.
As the genetic algorithm selects ECO Features, they are combined to form an ensemble
using a boosting algorithm. We use the SAMME [24] variation of AdaBoost [25] for multiclass
classification. The boosting algorithm adjusts the weights of the dataset giving importance to
harder examples after each ECO Feature is created. This leads to ECO Features tailored to certain
aspects of the dataset. Once the desired number of ECO Features have been constructed, they are
combined into an ensemble. This ensemble predicts the class of new input images by passing the
image through all of the ECO Feature learners, letting each one vote for which class should be
predicted. Figure 2.7 depicts a complete ECO Features system.
Figure 2.7: A system diagram of the original ECO Features Algorithm. Each classifier has its ownECO Feature transform. The outputs of each classifier are collected into a weighted summation todetermine the final prediction.
15
Since the publications of [19] and [16], ECO Features was applied to the problem of visual
inspection where ECO Features were used to determine the maturity of date fruits [26]. This
algorithm has also been used in industry to automate visual inspection for other processes.
2.4 The ECO Jet Features Algorithm
In this section, we look at how Jet Features can be introduced into the ECO Features al-
gorithm. We call this modified version the ECO Jet Features algorithm. This modification sped
up performance while maintaining accuracy on simple image classification. It was specifically
designed to allow for easy implementation in hardware.
2.4.1 Jet Feature Selection
The ECO Jet Features algorithm uses a similar genetic algorithm to the one discussed in
Section 2.3. Instead of selecting image transforms from a pool and combining them into a series, it
simply uses a single Jet Feature. The amount of scaling and partial derivatives are the parameters
that are tuned through evolution. These four parameters are bounded from 0 to a set maximum,
forming the multsicale local jet, similar to equation (2.2). We found that bounding the partial
derivatives, δx,δy ∈ [0,2], and scaling factors, σx,σy ∈ [0,6], is effective at finding good features.
In order to accommodate the use of jet features, mutation and cross over are redefined. The
four parameters of the jet feature, δx,δy,σx and σy, are treated like genes that make up the genome
of the feature. During mutation, the values of these individual parameters are altered. During cross
over, the genes of a child jet feature would each be copied from either the father or the mother
genome. This selection is made randomly. This is illustrated in Figure 2.8
2.4.2 Advantages in Software
The jet feature transformation can be calculated with a series of matrix shifts, additions,
and subtractions. Since the elements of the bases kernels for the transformations are either 1 or
-1, there is no need for general convolution with matrix multiplication. Instead, a jet transform
can be applied to image A by making a copy of A, shifting it in either the x or y directions by one
pixel and then adding or subtracting it with the original. Padding is not used. Using jet transforms,
16
Figure 2.8: How mutation (top) and crossover (bottom) are defined for Jet Features.
there is no need for multiplication or division operations. We do recognize that normalization is
normally used with traditional kernels, however, since this normalization is applied to all elements
equally of an input image and the output values are fed into a classifier, we argue that the only
difference normalization makes is to keep the intermediate values of the image representation
reasonably small. In practice, we see no improvement in accuracy by normalizing during the jet
feature transform.
Another property of Jet Features that allows for efficient computation is the fact that one
Jet Feature of a higher order can be calculated using the result of a Jet Feature of a lower order.
The outputs of the lowest order jet features can be used as an input to any other ECO Jet Feature
that has parameters of equal or greater value. Calculating all of the jet features in an ensemble
in the shortest amount of time can be seen as an optimization problem where the order of which
features are calculated is optimized to require the minimum number of operations. We explored
optimization strategies that would find the most efficient order of buffers and arithmetic units for
a given ensemble of jet features. We did not see much improvement when employing complex
scheduling strategies. The most effective and simple strategy was calculating features with the
lowest sum of δx,δy,σx, and σy first and working to higher ordered features, reusing lower ordered
outputs where we could.
17
2.4.3 Advantages in Hardware
Jet features were developed to make calculations in hardware simpler for our new algo-
rithm than the original ECO Features algorithm. The original ECO Features algorithm has several
attributes that make it difficult to implement in hardware. Similar to the advantages discussed in
Section 2.4.2, the jet features are even more advantageous in a hardware implementation.
First, the original algorithm forms features from a generic pool of image transforms. This
is relatively straightforward to implement in software when a computer vision library is available,
only requiring extra room in memory for the library calls. In hardware physical space in silicon
must be dedicated to units to perform each of these transforms. The jet feature transform utilizes a
set of simple operations that are reused in every single jet feature.
Second, the transforms of the original algorithm are not commutative. The order they are
executed affects the output. Intermediate calculations would need to have the ability to be routed
from every transform to every other transform. This kind of complexity could be tackled with a
central memory, a bus system, redundant transform modules, and/or a scheduler. The jet transform
is cumulative and the order of convolutions does not matter. Routing intermediate calculations
becomes trivial.
Third, intermediate calculations from the original ECO Feature transformations can rarely
be used in any other ECO Feature. On the other hand, jet features are cumulative. Using this
property, the ECO Jet Features algorithm is easily pipelined and calculations for multiple features
can be calculated simultaneously. In fact, instead of scheduling the order in which features are
calculated, our architecture calculates every possible feature every time an input image is received.
This allows for easy reprogrammability for different applications. The feature outputs required for
that specific model are used and the others are ignored. Little extra hardware is required, and there
is no need for a dynamic control unit.
Fourth, calculating jet features in hardware requires only addition and subtraction operators
in conjunction with pixel buffers. The transforms of the original ECO Features algorithm require
multiplication, division, procedural algorithm control, logarithm operators, square root operators
and more to implement all of the transforms available to the algorithm. In hardware, these opera-
tions can require large spaces of silicon and can generate bottlenecks in the pipeline. As mentioned
in Section 2.4.2, the Gaussian blur does require a division by two when normalizing. However,
18
with a fixed base-two number system, this does not require any extra hardware. It is merely a left
shift of the imaginary decimal place.
2.5 Hardware Architecture
The ECO Jet Features hardware architecture consists of two major parts, a jet feature unit,
and a classifier unit. A simple routing module connects the two, as shown in Figure 2.9. Input
image data is fed into the jet features unit as a series of pixels, one pixel at a time. This type
of serial output is common for image sensors, but we acknowledge that if the ECO Jet Features
algorithm was embedded close to the image sensor, other more efficient types of data transfer
would be possible. As the data is piped through the jet features unit, every possible jet feature
transform is calculated. Only the features that are relevant to the specific loaded model are then
routed to the classifier unit. The classifier unit contains a random forest for every ECO Jet Feature
in the model and the appropriate output from the jet features unit is processed by the corresponding
random forest.
Figure 2.9: System diagram for the ECO Jet architecture. The Jet Features Unit computes everyfeature for a given multiscale local jet on an input image. A router connects only the ones thatwere selected during training to a corresponding random forest. The forests each vote on a finalprediction.
2.5.1 The Jet Features Unit
The jet features unit calculates every feature for a given multiscale local jet. An input image
is fed into the unit one by one, in row-major order. As pixels are piped through the unit it produces
multiple streams of pixels, one stream for every feature in the jet.
19
All convolutions in jet feature transforms require the addition or subtraction of two pixels.
This is accomplished by feeding the pixels into a buffer, where the incoming pixel is added or
subtracted from the pixel at the end of the buffer, as shown in Figure 2.10. Convolutions in the x
direction (along the rows) require only a single pixel to be buffered due to the fact that the image
sensor transmits pixels in row-major order. Convolutions in the y direction, however, require pixel
buffers to be the width of the input image. A pixel must wait until a whole row of pixels is read in
for its neighboring pixel to be fed into the system.
Figure 2.10: Convolution units. The top unit shows only a single buffer needed for convolutionalong rows in the x direction. The bottom unit shows a large buffer used for convolution along thecolumns in the y direction.
With units for convolution in both the x and y directions, an array of convolutional units is
connected to produce every jet feature for a given multiscale local jet. By restricting the multiscale
local jet, there are fewer possible jet features. In order to see the effect of restricting the maximum
allowed values for δ and σ , we tested various configurations of the BYU Fish Species dataset
(Section 2.6.1). This dataset is further explained in Section 2.6. We restricted both δ and σ to
a maximum of 15,10 and 5. Each of these configurations was trained and tested. We observed
that the genetic algorithm often selected either 0 or 1 as values for δ and so a configuration where
δ ≤ 1 and σ ≤ 5 was tested as well. Figure 2.11 shows the average test accuracy for each of
20
these configurations as the model is being trained. From these results, we feel confident that
restricting δ and σ does not hurt the algorithm’s accuracy significantly. It does, however, restrict
the space significantly, which can mean a much more compact hardware design. In our hardware
architecture, we restrict δ ≤ 1 and σ ≤ 5 and only have 144 different possible jet features.
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80Number of Jet Features
0.00
0.02
0.04
0.06
0.08
0.10
Error R
ate
max 15max 10max 5max 5/max 1
Figure 2.11: Comparison of multiscale local jets. The maximum allowable variance in Gaussianblur and order of differentiation was limited by 15,10 and 5. A fourth case was tested wherevariance in Gaussian blur was bounded by 5 and order of differentiation and was bounded by 1.
In order to calculate every jet feature using the least amount of hardware, we perform
convolutions in the y direction first. Since these convolutions require whole line buffers, it is best
to minimize the required number of these units. Each of these convolutions is then fed into similar
modules that perform the convolutions in the x-direction, each producing its own 12 jet features
for a total of 144 different features.
21
2.5.2 The Random Forest Unit
A random forest is a collection of decision trees. Each tree votes individually for a partic-
ular class based on the values of specific incoming pixels. Nodes of these trees act like gates that
pass their input to one of two outputs. Each node is accosted with a specific pixel location and
value. If the actual value of that pixel is less than the value associated with the node, the “left”
output is opened. If the actual value is greater, the “right” output is opened.
Every random forest receives the location information for every incoming pixel. Each tree
contains a lookup table for storing comparison values of each pixel and which pixel location is
associated with it. Entries are stored in the order in which they will be read into the tree, in row-
major order. A pointer is used to keep track of which entry will be seen next by the tree. Once the
target pixel arrives, the pointer moves to the next entry. The sorting of these node values is done
by the host computer system and it is assumed they are loaded in the proper order.
Once there is an open path from the root node of a tree to a leaf node, the tree sends
its prediction, stored at the leaf node. Once all trees in a forest have sent their predictions, the
prediction with the most votes is sent to the tabulating unit. Once all forests have finished voting,
their votes are combined with their weights learned from the AdaBoost process. The classification
with the highest score is selected as the final prediction.
In order to select an efficient configuration, we experimented with different sizes of random
forests and different numbers of jet features. We varied the number of trees in a forest, the maxi-
mum depth of each forest, and the total number of creatures, which corresponds with the number
of forests. Figure 2.12 shows the accuracy of these different configurations compared with the total
count of nodes in their forests. A pattern of diminishing returns is apparent as models grow to be
more than 3,000 nodes. The models that performed best were ones with at least 10 creatures and
a balance between tree count and tree depth. We used a configuration of 10 features with random
forests of 5 trees, each 5 levels deep. This setup requires 3,150 nodes.
22
0 10000 20000 30000 40000 50000 60000Number of Nodes
0.6
0.7
0.8
0.9
1.0Te
st Accuracy
Figure 2.12: Comparison of test accuracy to the number of total nodes in the random forests.Various number of trees, forests and depths of trees were tested.
2.6 Results
2.6.1 Datasets
ECO Features was designed to solve visual inspection applications. These applications
typically involve fixed camera conditions where the objects being inspected are similar. This in-
cludes manufactured goods that are being inspected for defects or agricultural produce that is being
graded for size or quality. These applications usually are fairly specific and real-world users do not
have extremely large datasets.
We first explored the accuracy of ECO Features and ECO Jet Features on the MNIST and
CIFAR-10 datasets. Both are common datasets used in deep learning with small images with only
10 different classes in each dataset. The MNIST dataset consists of 70,000 28x28 pixel images
(10,000 reserved for testing) and the CIFAR-10 dataset consists of 60,000 32x32 pixel sized images
23
(10,000 reserved for testing). The MNIST dataset features handwritten numerical examples and
the CIFAR-10 images each consist of various objects. Examples are shown in Figure 2.13.
Figure 2.13: Examples from the MNIST dataset (top) with handwritten digits. Examples from theCIFAR-10 dataset (bottom). Classes include airplane, bird, car, cat, deer, dog, frog, horse, ship,and truck.
We also tested our algorithms on a dataset that is more typical for visual inspection tasks.
MNIST and CIFAR-10 contain many more images than what is typically available to users solv-
ing specific visual inspection tasks. Visual inspection training sets also include less variation in
object type and camera conditions than in the CIFAR-10 dataset. The MNIST and CIFAR-10
datasets consist of small images, which makes execution time atypically fast for visual inspection
applications. For these reasons we also used the BYU Fish dataset in our experimentation.
The BYU Fish dataset consists of images of fish from eight different species. The images
are 161 pixels wide by 46 pixels tall. We split the dataset to include 778 training images and
254 test images. Images were converted to grayscale before being passed to the algorithm. Each
specimen is oriented in the same way and the camera pose remains constant. This type of dataset
is typical for visual inspection systems where camera conditions are fixed and a relatively small
number of examples are available. Examples are shown in Figure 2.14.
24
Figure 2.14: Examples from the BYU Fish dataset. Each image is of a different fish species.
2.6.2 Accuracy on MNIST CIFAR-10
To get a feel for how Jet Features changes the capacity of ECO Features to learn, we trained
the ECO Features algorithm and the ECO Jet Features algorithm on the MNIST and CIFAR-10
datasets. These datasets have many images and were specifically designed for deep learning algo-
rithms which can take advantage of such a large training set. We note that the absolute accuracy
of these models did not compare well with the state-of-the-art deep learning, but we use these
larger datasets to fully test the capacity of our ECO Jet Features in comparison to the original ECO
Features algorithm.
Each model was trained with random forests of 15 trees up to 15 levels deep. When testing
on CIFAR-10, each model trained 200 creatures. The accuracy as features were added is shown in
Figure 2.15. The models were only trained to 100 creatures on MNIST where the models seem to
converge, as shown in Figure 2.16. The CIFAR-10 results show that the models converge to similar
accuracy while ECO Jet Features show a slight improvement ( 0.3%) over the original algorithm on
MNIST. From these results, we conclude that Jet Features introduce no noticeable loss in accuracy.
2.6.3 Accuracy on BYU Fish dataset
We also trained on the BYU Fish dataset with the same experimental setup that was used
on the other datasets. The results are plotted in 2.17. While the datasets do seem to converge to
25
Figure 2.15: Accuracy comparison between the original ECO Features algorithm and the ECO JetFeatures algorithm on CIFAR-10. Once the models seem to converge, there is no evidence of lostaccuracy in the ECO Jet Features model.
similar accuracy, results from training using such a small dataset may not be quite as convincing
as those obtained using larger datasets. These results were for completeness since this dataset was
used in our procedure and meant for testing speed, efficiency, and model size.
2.6.4 Software Speed Comparison
While the primary objective of the new algorithm was to be hardware friendly, it was inter-
esting to explore the speedup gained in software. Each algorithm was implemented on a full-sized
desktop PC running a Skylake i7 Intel processor, using the OpenCV library. OpenCV contains
built-in functionality for most of the transforms from the original ECO Features algorithm. It also
provides vectorization for Jet Feature operations.
26
Figure 2.16: Accuracy comparison between the original ECO Features algorithm and the ECO JetFeatures algorithm on MNIST. Once the models seem to converge, ECO Jet Features seem to havea slight edge in accuracy, about 0.3%.
We attempted to accelerate these algorithms using GPUs but found this was only possible
on images that were larger than 1024x768. Even using images that were this large did not provide
much acceleration. The low computational cost of the algorithm does not justify the overhead of
CPU to GPU data transfer.
A model of 30 features was created for both. The BYU Fish dataset was used because the
image sizes are more typical of real-world applications. The original algorithm averaged a run
time of 10.95ms and our new ECO Jet Features algorithm averaged an execution time of 2.95ms,
which is a 3.7x speedup.
27
Figure 2.17: Accuracy comparison between the original ECO Features algorithm and the ECO JetFeatures algorithm on the BYU Fish dataset.
Table 2.2: ECO Jet Features architecture total hardware usage on a Kintex-7 325 FPGA using 10creatures, 5 trees at a depth of 5.
Resource Number Used Percent of AvailableTotal Slices 10,868 4.9%
Total LUTs 34,552 4.9%
LUTs as Logic 31,644 4.4%
LUTs as Memory 2,908 1.0%
Flip Flops 17,132 1.2%
BRAMs 0 0%
DSPs 0 0%
28
Table 2.3: The hardware usage for individual design units.
Unit Slices Total LUTs LUTs as Logic LUTs as Memory Flip FlopsJet Features Unit 7,593 23,253 22,37 876 11,411Random Forests Unit 3,741 11,080 9,080 2,000 5,080
Individual Random Forests 374 1,108 908 200 508Individual Decision Trees 73 210 171 40 93
Feature router 49 40 40 0 520AdaBoost Tabulation 61 180 148 32 121
2.6.5 Hardware Implementation Results
Our hardware architecture was designed in SystemVerilog. It was synthesized and imple-
mented for a Xilinx Vertex-7 FPGA using the Vivado design suite. From our analysis reported
in Section 2.5, Figures 2.11 and 2.12, we implemented a model with 10 features, 5 trees in each
forest with a depth of 5, a maximum σ of 5, and maximum δ of 1. We used a round number of 100
pixels for the input image width. A model built around the BYU Fish dataset would have required
only 46 pixels in its line buffers, but this length is small due to the oblong nature of fish. We feel a
width of 100 pixels is more representative of general visual inspection tasks.
The total utilization of available resources on the Xilinx Vertex-7 is reported in Table 2.2.
One interesting point is that this architecture requires no Block RAM (BRAM) or Digital Signal
Processing (DSPs) units. BRAMs are dedicated RAM blocks that are built into the FPGA fabric.
DSPs are generally used for more complex arithmetic operations, like general convolution. Our
architecture, however, was compact enough and simple enough to not require either of these re-
sources and instead host all of its logic in the main FPGA fabric. Look Up Tables (LUTs), make
up the majority of the fabric area and are used to store all data and perform logic operations.
To give a quick reference of FPGA utilization for a CNN on a similar Vertex 7 FPGA,
Prost-Boucle et al. [27] reported using 22% to 74.4% of the 52.6 Mb of total BRAM memory for
various sizes of the model. Our model did not require the use of any of these BRAM memory
units. When comparing the number of LUTs used as logic, Prost-Boucle et al. used 112% more
than our model in their smallest model and 769% more on their larger more accurate model.
The pixel clock speed can safely reach 200MHz. Since the design was fully pipelined
around the pixel clock, images from the BYU Fish dataset could, in theory, be processed in 37µs.
29
This is a 78.3x speedup over the software implementation on a full-sized desktop PC. A custom
silicon design could be even faster than this FPGA implementation.
Table 2.3 shows the relative sizes for individual units of the design. Some FPGA logic
slices are shared between units and the sum of the individual unit counts exceeds the totals listed in
2.2. With a setup of 10 creatures, 5 trees per forest with a depth of 5, the Jet Features Unit makes
up about 70% of the total design. However, since this unit is generating every jet in the multiscale
local jet, it does not grow as more features are added to the model. We showed in Figure 2.11 that
using a large local jet does not necessarily improve performance.
The Random Forest unit makes up less than 35% percent of the design in all aspects other
than LUT units that are used as memory, which is a subset of total LUTs. But, only 10 features
were used and more could be added to increase accuracy as shown in Figure 2.17. Extrapolating out
from these numbers, if all 144 possible features were added to this design, only 30% of resources
available to the Vertex-7 would be used and 87.9% of them will be dedicated to the Random Forests
Unit.
These results showed how compact this architecture is. The simple operations and feedfor-
ward paths used in this design could very feasibly be implemented in custom silicon as well.
2.7 Smart Camera System
We built a compact smart camera for automated visual inspection to demonstrate how ECO
Jet Features can be used in an industrial setting. The camera performed real-time image classifica-
tion using ECO Jet Features. It sent signals to sorting mechanisms to let it know when to activate
and sort objects according to their classification.
2.7.1 System Overview
The smart camera system consisted of three major parts, an Nvidia TX1 embedded GPU/CPU,
an Arduino Uno microcontroller, and a FLIR Grasshopper high-speed camera, as shown in Figure
2.18.
Our original smart camera design targeted a GPU-powered algorithm which is why it fea-
tures an Nvidia TX1. The TX1 is a small GPU/CPU system designed for embedded platforms.
30
Figure 2.18: System diagram of the smart camera system.
ECO Jet Features is efficient enough to not require a GPU. Instead, the ARM Cortex A57 on the
TX1 is enough to run ECO Jet Features image classification in real time. The CPU classifies the
incoming images from the camera and sends the results to the microcontroller.
The Arduino UNO microcontroller is used to trigger sorting mechanisms. It kept a map
of all incoming objects, where they were, which sorting mechanisms need to be triggered, when
they need to be triggered, and for how long. In our demonstration, the Arduino was connected
to pneumatic valves that shot air at the objects in order to blow them off a conveyor belt and into
sorting bins. The Arduino was connected to the conveyor belt in order to monitor the speed of the
belt. This allowed the speed of the belt to be adjusted.
A FLIR Grasshopper high-speed camera was used to capture images of the objects passing
under the camera on a conveyor belt. Dedicated I/O pins on the camera were connected to the
31
Arduino to trigger the camera in real time. The Arduino tracked the speed of the conveyor belt and
triggers the camera when a new section of the belt enters the camera frame.
2.7.2 Software
We developed a custom software GUI to control, configure and train the smart camera
system. The software automatically calibrated the microcontroller to determine how fast objects
pass under the camera. It could be used to capture a set of training data to train the ECO Features
algorithm by passing example objects under the camera and later labeling them. It could control
the camera’s settings like aperture, focus, and field of view.
2.8 Conclusion
We have presented Jet Features, learned convolutional kernels that are efficient in both
software and hardware implementations. We applied them to the ECO Features algorithm. This
change to the algorithm allows faster software execution and hardware implementation. In soft-
ware, the algorithm gained a 3.7x speedup with no noticeable loss in accuracy. We also presented
a compact hardware architecture for our new algorithm that is fully pipelined and parallel. On an
FPGA, this architecture can process images in 37µs, a 78.3x speedup over the improved software
implementation.
Jet Features are related to the idea of multiscale local jets. Large groups of these transforms
can be calculated in parallel. They incorporate many other common image transforms such as the
Gaussian blur, Sobel edge detector, and Laplacian transform. The simple operators required to
calculate jet features allows them to be easily implemented in hardware in a completely pipelined
and parallel fashion.
With a compact classification architecture for visual inspection, automatic visual inspec-
tion logic can be embedded into image sensors and compact hardware systems. Visual inspection
systems can be made smaller, cheaper, and available to a wider range of visual inspection applica-
tions.
32
CHAPTER 3. A REVIEW OF BINARIZED NEURAL NETWORKS
In this chapter, we review an existing use of binary values for image classification called
Binarized Neural Networks (BNNs). BNNs are deep neural networks that use binary values for
activations and weights, instead of full precision values. With binary values, BNNs can execute
computations using bitwise operations, which reduces execution time. Model sizes of BNNs are
much smaller than their full precision counterparts. While the accuracy of a BNN model is gener-
ally less than full precision models, BNNs have been closing the accuracy gap and are becoming
more accurate on larger datasets like ImageNet. BNNs are also good candidates for deep learning
implementations on FPGAs and ASICs due to their bitwise efficiency. We give a tutorial of the
general BNN methodology and review various contributions, implementations, and applications of
BNNs.
3.1 Introduction
Deep neural networks (DNNs) are becoming more powerful. However, as DNN models
become larger they require more storage and computational power. Edge devices in IoT systems,
small mobile devices, power-constrained, and resource-constrained platforms all have constraints
that restrict the use of cutting-edge DNNs. Various solutions have been proposed to help solve
this problem. Binarized Neural Networks (BNNs) are one solution that tries to reduce the memory
and computational requirements of DNNs while still offering similar capabilities of full precision
DNN models.
There are various types of networks that use binary values. In this chapter, we focus on
networks based on the BNN methodology first proposed by Courbariaux et al. in [28] where
both weights and activations only use binary values, and these binary values are used during both
inference and backpropagation training. From this original idea, various works have explored
33
how to improve their accuracy and how to implement them in low-power and resource-constrained
platforms.
Almost all work on BNNs has focused on advantages that are gained during inference time,
rather than training time. Unless otherwise stated, when the advantages of BNNs are mentioned in
this chapter, we will assume these advantages apply mainly to inference. However, we will look at
the advantages of BNNs during training as well. Since BNNs have received substantial attention
from the digital design community, we also focus on various implementations of BNNs on FPGAs.
BNNs build upon previous methods for quantizing and binarizing neural networks, which
are reviewed in Section 3.3. Since terminology throughout the BNN literature may be confusing
or ambiguous, we review important terms in Section 3.2. We outline the basic mechanics of BNNs
in Section 3.4. Section 3.5 details the major contributions to the original BNN methodology.
Techniques for improving accuracy and execution time at inference are covered in Section 3.6. We
present accuracies of various BNN implementations on different datasets in Section 3.7. FPGA
and ASIC implementations are highlighted in Sections 3.8.1 and 3.8.5.
3.2 Terminology
Before diving into the details of BNNs and how they work, we want to clarify some of the
terminologies that will be used throughout this review. Some of the terms used in the literature
interchangeably and can be ambiguous.
Weights: Learned values that are used in a dot product with activation values from previous
layers. In BNNs, there are real-valued weights that are learned and binary versions of those weights
which are used in the dot product with binary activations.
Activations: The outputs from an activation function that are used in a dot product with the
weights from the next layer. Sometimes the term “input” is used instead of activation. We use the
term “input” to refer to input to the network itself and not just the inputs to an individual layer. In
BNNs, the output of the activation function is a binary value and the activation function is the sign
function.
Dot product: A multiply-accumulate operation occurs in the “neurons” of a neural network.
The term “multiply-accumulate” is used at times in the literature, but we use the term dot product
instead.
34
Parameters: All values that are learned by the network through backpropagation. This
includes weights, biases, gains, and other values.
Bias: An additive scalar value that is usually learned. Found in batch normalization layers
and specific BNN techniques that will be discussed later.
Gain: A scaling factor that is usually learned, (but sometimes extracted from statistics (Sec-
tion 3.5.2)). Similar to bias. A gain is applied after a dot product between weights and activations.
The term scaling factor is used at times in the literature, but we use gain here to emphasize its
correlation with bias.
Topology: The specific arrangement of layers in a network. The term “architecture” is used
frequently in the DNN community. However, the digital design and FPGA community also use the
term architecture to refer to the arrangement of hardware components. We use topology to refer to
the layout of the DNN model.
Architecture: The connection and layout of digital hardware. Not to be confused with the
topology of the DNN models themselves.
Fully Connected Layer: As a clarification, we use the term fully connected layer instead of
dense layer like some of the literature reviewed in this chapter.
3.3 Background
Various methods have been proposed to help make DNNs smaller and faster without sac-
rificing excess accuracy. Howard et al. proposed channel-wise separable convolutions as a way
to reduce the total number of weights in a convolutional layer [29]. Other low rank and weight
sharing methods have been explored [30, 31]. These methods do not reduce the data width of the
network, but instead, use fewer parameters for convolutional layers while maintaining the same
number of channels and kernel size.
SqueezeNet is an example of a network topology designed specifically to reduce the num-
ber of parameters used [32]. SqueezeNet requires fewer parameters by using more 1× 1 kernels
for convolutional layers in place of some 3×3 kernels. They also reduce the number of channels
in the convolutional layers to reduce the number of parameters even further.
Most DNN models are overparameterized and network pruning can help reduce size and
computation [33–35]. Neurons that do not contribute much to the network can be identified and
35
removed from the network. This leads to sparse matrices and potentially smaller networks with
fewer calculations.
3.3.1 Network Quantization Techniques
Rather than reducing the total number of parameters and activations to be processed in a
DNN, quantization reduces the bit width of the values used. Traditionally, 32-bit floating-point
values have been used in deep learning. Quantization techniques use data types that are smaller
than 32-bits and tend to focus on fixed point calculations rather than floating-point. Using smaller
data types can offer a reduction in total model size. In theory, arithmetic with smaller data types
can be quicker to compute and fixed-point operations can be more efficient than floating-point
ones. Gupta et al. show that reducing datatype precision in a DNN offers reduced model size with
limited reduction in accuracy [36].
We note, however, that 32-bit floating-point arithmetic operations have been highly op-
timized in GPUs and most CPUs. Performing fixed-point operations on hardware with highly
optimized floating-point units may not achieve the kinds of execution speed advantages that over-
simplified speedup calculations might suggest.
Courbariaux et al. compare accuracies of trained DNNs using various sizes of fixed and
floating-point values for weights and activations [37]. They even examine the effect of a hybrid
dynamic fixed-point data type and show how comparable accuracy can be obtained with sub-32-bit
precision.
Using quantized values for gradients has also been explored in an effort to reduce training
time. Zhou et al. experiment with several low bit widths for gradients [38]. They test various com-
binations of low bit-widths for activations, gradients, and weights. They observe that using higher
precision is more useful in gradients than in activations, and using higher precision in activations
is more useful than in weights.
3.3.2 Early Binarization
The most extreme form of network quantization is binarization. Binarization is a 1-bit
quantization where data can only have two possible values. Generally, −1 and +1 have been used
36
for these two values (or −γ and +γ when scaling is considered, see Section 3.6.1). Notice that
quantized networks that use the values −1 and 0 and +1 are not binary, but ternary, a confusion in
some of the literature [39–42]. They exhibit a high level of compression and simple arithmetic but
do not benefit from the single bit simplicity of BNNs since they require 2-bits of precision.
The idea of using binary weights predates the current boom in deep learning [43]. Early
networks with binary values contained only a single hidden layer [43,44]. These early works point
out that backpropagation (BP) and stochastic gradient descent (SGD) cannot be directly applied
to these networks since weights cannot be updated in small increments. As an alternative, early
works with binary values used variations of Bayesian inference. More recently [45] applies a
similar method, Expectation Backpropagation, to train deep networks with binary values.
Courbariaux et al. claim to be the first to train a DNN from start to finish using binary
weights and BP with their BinaryConnect method [46]. They use real-valued weights which are
binarized before being used by the network. During backpropagation, the gradient is applied to
the real-valued weights using the Straight-Through Estimator (STE) which is explained in Section
3.4.1.
While binary values are used for the weights, Courbariaux et al. retain full precision ac-
tivations in BinaryConnect. This eliminates the need for full precision multiplications but still
requires full precision accumulations. BinaryConnect is named in reference to DropConnect [47],
but connections are binarized instead of being dropped.
These early works in binary neural networks are certainly binary in a general sense. How-
ever, this chapter defines BNNs as networks that use binary values for both weights and activations
allowing for bitwise operations instead of multiply-accumulate operations. Soudry et al. was one
of the first research groups to focus on DNNs with binary weights and activations [45]. They use
Bayesian learning to get around the problems of learning with binary values [45]. However, Cour-
bariaux et al. are able to use binary weights and activations during training with backpropagation
techniques and take advantage of bitwise operations [28, 48]. Their BNN method is the basis for
most binary networks that have come since (with some notable exceptions in [49, 50]).
37
3.4 An Introduction to BNNs
Courbariaux et al. [28, 48] develop the BNN methodology that is used by most network
binarization techniques. In this section, we will review the functionality of this original BNN
methodology. Other specific details from [28, 48] will be reviewed in Section 3.5.1
In BNNs, both the weights and activations are binarized. This reduces the memory require-
ment for BNNs and the computational complexity through the use of bitwise operations.
3.4.1 Binarization of Weights
Courbariaux et al. first provide a way to train using binary weights in [46] using back-
propagation with a gradient-descent based method (SGD, Adam, etc.). Using binary values during
training provides a more representative loss to train against instead of only binarizing a network
once training is complete. Computing the gradient of the loss w.r.t binary weights through back-
propagation is not a problem. However, updates to the weights using gradient descent methods
(SGD, Adam, etc.) prove impossible with binary weights. Gradient descent methods make small
changes to the value of the weights, which cannot be done with binary values.
In order to solve this problem, Courbariaux et al. keep a set of real-valued weights, WR,
which are binarized within the network to obtain binary weights, WB. WR can then be updated
through backprop and the incremental updates gradient descent. During inference, WR is not
needed and the binary weights are the only weights that are stored and used. Binarization is done
using a simple sign function
WB = sign(WR) (3.1)
resulting in a tensor with values of +1 and −1.
Calculating the gradient of the loss w.r.t. the real-valued directly weights is meaningless
due to the sign function used in binarization. The gradient of the sign function is 0 or undefined
at every point. To get around this problem, Courbariaux et al. use a heuristic called the Straight
Through Estimator (STE) [51]. This method approximates a gradient by bypassing the gradient of
the layer in question. The problematic gradient is simply turned into an identity function
∂L∂WR
=∂L
∂WB(3.2)
38
where L is the loss at the output. This gradient approximation is used to update the real-valued weights.
This binarization is sometimes thought of as a layer unto itself. The weights are passed
through a binarization layer that evaluates the sign of the values in the forward pass and performs
an identity function during the backward pass, as illustrated in Figure 3.1
Figure 3.1: A visualization of the sign layer and Straight-Through Estimator (STE). While the realvalues of the weights are processed by the sign function in the forward pass, the gradient of thebinary weights is simply passed through to the real-valued weights.
Using the STE, the real-valued weights can be updated with an optimization strategy, like
SGD or Adam. Since the gradient updates can affect the real-valued weights WR without changing
the binary values WB, if values in WR are not bounded, they can accumulate to very large numbers.
For example, if during a large portion of training a positive value of WR is evaluated to have a
positive gradient, every update will increase that value. This can create large values in WR. BNNs
clip the values of WR between −1 and +1. This keeps the values of WR closer to WB.
3.4.2 Binarization of Activations
Binarization of the activation values was introduced in the first BNN paper by Courbariaux
et al. [28]. In order to binarize the activations, they are passed through a sign function using an
STE in the backward pass, similar to how the weights are binarized. This sign function serves as
the activation function in the network. In order to obtain good results, Courbariaux et al. find that
they need to cancel out the gradient in the backward pass if the input to the activation was too large,
39
using∂L∂aR
=∂L∂aB∗1|aR|≤1 (3.3)
where aR is the real-valued input to the activation function and aB is the binarized output of the
activation function. 1|aR|≤1 is the indicator function that evaluates to 1 if |aR| ≤ 1 and 0 otherwise.
This zeros out the gradient if the input to the activation function is too large. This functionality
can be achieved by adding a hard tanh function before the sign activation function, but this layer
would only have any effect in the backward pass and has no effect in the forward pass.
3.4.3 Bitwise Operations
When using binary values, the dot product between weights and activations can be reduced
to bitwise operations. The binary values can either be −1 or +1. These signed binary values are
encoded with a 0 for −1 and a 1 for +1. To be clear, we refer to the signed values −1 and +1 as
binary “values” and their encodings, 0 and 1, as binary “encodings”.
Using an XNOR logical operation on the binary encodings is equivalent to performing
multiplication on the binary values as seen in Table 3.1
Table 3.1: This table shows how the XNOR operation of the encoding can be equivalent to multi-plications of the binary values, in parenthesis.
Encoding (Value) XNOR (Multiply)
0 (−1) 0 (−1) 1 (+1)0 (−1) 1 (+1) 0 (−1)1 (+1) 0 (−1) 0 (−1)1 (+1) 1 (+1) 1 (+1)
A dot product requires an accumulation of all the products between values. XNOR can per-
form the multiplication on a bitwise level, but performing the accumulation requires a summation
of the results of the XNOR operation. Using the binary encodings resulting from the XNOR oper-
ation, this can be done by counting the number of 1 bits in a group of XNOR products, multiplying
this value by 2, and subtracting the total number of bits producing an integer value. Processor
instruction sets often include a popcount instruction to count the number of ones in a binary value.
40
These bitwise operations are much simpler to compute than multi-bit floating-point or
fixed-point multiplication and accumulation. This can lead to faster execution times and/or fewer
hardware resources required. However, theorizing efficiency speedups is not always straightfor-
ward.
For example, when looking at the execution time of a CPU, some papers that we reviewed
here use the number of instructions as a measure of execution time. The 64-bit x86 instruction set
allows a CPU to perform a bitwise XNOR operation between two 64-bit registers. This operation
takes a single instruction. With a similar 64 bit CPU architecture, two 32-bit floating-point multi-
plications could be performed. One could conclude that the bitwise operations would have a 32×
speed up over the floating-point operations. However, the number of instructions is not a measure
of execution time. Each instruction can take a variable amount of clock cycles to execute. Instruc-
tion and resource scheduling within a modern CPU core is dynamic and the number of cycles to
produce an instruction’s result depends on other instructions that have come before. CPUs and
GPUs are optimized for different types of instruction profiles. Instead of using the number of in-
structions as a measure of efficiency, it is better to look at the actual execution times. Courbariaux
et al. [28] observe a 23× speedup when optimizing their code for bitwise operations.
Not only do bitwise operations offer faster execution times in software-based implementa-
tions, but BNNs also require fewer hardware requirements in digital designs.
3.4.4 Batch Normalization
Batch normalization (BN) layers are common practice in deep learning. They condition the
values within a network for faster training and act as a form of regularization. In BNNs, they are
considered essential. BN layers not only condition the values used during training, but they contain
gain and bias terms that are learned by the network. These learned terms help add complexity to
BNN which can suffer without them. The efficiency of BN layers is discussed in Section 3.6.7.
3.4.5 Accuracy
While BNNs are compact and efficient compared to their full precision counterparts, they
suffer degradation in accuracy. The original BNN proposed in [28] suffers a 3% loss in accuracy
41
on the CIFAR-10 dataset and did not show comparable results on the larger ImageNet dataset.
However, with improvement from other authors and modifications to the BNN methodology, more
recent BNNs have achieved comparable results on the ImageNet data set, showing a decrease in
accuracy of 3.1% on top-5 accuracy and 6.0% on the top-1 accuracy [52].
3.4.6 Robustness to Attacks
Full precision DNNs have been shown to be susceptible to adversarial attacks [53, 54].
Small perturbations to an input can cause gross classification in a classifier network. This is espe-
cially true of convolutional networks where input images can be altered in ways that are impercep-
tible to humans but cause extreme failure in the network.
BNNs, however, have shown robustness to this problem [55,56]. Small changes in the input
image have less of an effect on the network activations since discrete values are used. Adversarial
perturbations are generally computed using gradient methods, which, as discussed above, are not
directly commutable in BNNs.
3.5 Major BNN Developments
While there has been much work done on BNNs and how to improve their accuracy, a
handful of works have put forth key ideas that significantly expound upon the original methodology
of BNNs [28]. Before discussing and comparing the literature of BNNs as a whole, we wish to step
through each of these selected works and look at the contributions of each work. These works are
either highly referenced by BNN literature, directly compared to in much of the BNN literature,
and/or made significant changes to the BNN methodology. We feel it is useful to examine them
individually. We recognize that this selection of works is somewhat subjective and works not
included in this section have made contributions as well. After reviewing each of these works in
isolation, we will examine the BNN literature as a whole, summarizing our observations by topic
rather than by publication.
42
3.5.1 The Original BNN
We already summarized the basics of the originally proposed methodology for BNNs in
Section 3.4. Here we will review details specific to [28,48] that were not mentioned earlier. Cour-
bariaux et al. reported their method and results which include details about their experiments on
the MNIST, SVHN, and CIFAR-10 experiments in [28]. In their other paper [48] they did not
include all of the details of these three experiments but did include their preliminary results on the
ImageNet dataset.
While most of the binarization done with BNNs is deterministic using the simple sign func-
tion, Courbariaux et al. discuss stochastic binarization, which can lead to better results than their
BNN model [28] and their earlier BinaryConnect Model [46]. Stochastic binarization binarizes
values using a probability distribution that favors binarizing to −1 when the value is closer to −1
and binarizing to +1 when the value is closer to +1. This helps regularize the training of the
network and produces better test results. However, working and producing probabilities for every
binarization requires more complex computation compared to deterministic binarization. Deter-
ministic binarization is used in almost all of the literature on BNNs.
Aside from the methodology presented in Section 3.4, Courbariaux et al. give details for
optimizing the learning process of BNNs. They point out that training a BNN takes longer than
traditional DNNs due to the STE heuristic needed to approximate the gradient of the real-valued
weights. To speed this process, they make modifications to the BN layer and the optimizer. For
both of these, they use a shift-based method, shifting all of the bits to the left or right, which is
an efficient way of multiplying or dividing a value by two. While this can speed up training time,
the majority of publications on BNNs do not focus on optimization during training time in favor
of test accuracy and speed at inference.
The specific topologies used for the MNIST, SVHN, and CIFAR-10 datasets are reused by
many of the papers that follow [28,48]. Instead of processing the MNIST dataset with convolutions
they used 3 fully connected (FC) layers with 4096 nodes in the hidden layers. For the SVHN and
CIFAR-10 datasets, they use a VGG-like topology with two 128-channel convolutional layers, two
256-channel convolutional layers, two 512-channel convolutional layers, and three fully-connected
layers with 1024 channels for the first two. This topology has been replicated by many works based
on BNNs. BNN topology is discussed in detail in Section 3.7.2.
43
While [28] does not include results on experiments using ImageNet, [48] does provide
some details on how the earliest BNN results for ImageNet. AlexNet and GoogleNet were both
used, replacing their FC convolutional layers with binary versions. While these models do not
perform very well during testing, it is a starting place that other works have built off of.
Courbariaux et al. point out that the bitwise operations of BNNs are not efficiently run
on standard deep learning or frameworks without additional coaxing. They build and provide a
custom GPU kernel that runs efficient bitwise operations. They demonstrate the benefits of their
technique showing a 7× speed up on the MNIST dataset.
A follow-on paper [57] provides responses to the next works reviewed below, applications
for LSTMs, and a generalization to other levels of quantization.
3.5.2 XNOR-Net
Soon after the original work on BNNs [48], Rastegari et al. proposed a similar model called
XNOR-Net [58]. XNOR-Net was proposed to perform well on the ImageNet dataset. XNOR-Net
includes all of the major methods from the original BNN but adds a gain term to compensate for
lost information during binarization. This gain term is extracted from the statistics of the weights
and activations before binarization.
A pair of gain terms is computed for every dot product in the convolutional layer. The
L1-norm of both the weights and activations are multiplied together to obtain this gain term. This
gain term does improve the performance of BNNs as shown by their results, however, their results
may be a bit misleading. Their own results were not reproducible in [38] and do not match the
results reported by Courbariaux et al. [48] or others [38, 59].
The gain term introduced by XNOR-Net seems to improve its performance, but it does
come at a cost. Rastegari et al. report a theoretical speedup of 64× over traditional DNNs. This
comes from a simple calculation that 1-bit operations should be 64× faster than 64-bit floating-
point operations. While this is not accurate, as described in Section 3.4.3, they do not take into
consideration the computations required to calculate the gain term. XNOR-Net must calculate L1-
norms for all convolutions during training and inference. The rest of the works presented in this
section make mention of this. While a gain term helps improve the accuracy of BNNs, the manner
in which XNOR-Net computes gain terms is costly.
44
Rastegari et al. point out that by placing the pooling layer after the dot product layer (FC
or convolutional layer) rather than after the activation layer, training is improved. This allows the
max pool layer to look at the signed integer values out of the dot product instead of the binary
values out of the activation. A max pool of binary values would have no information about the
magnitude of the inputs to the activation, thus the gradient is passed to all activations with a +1
value rather than the largest one before binarization.
3.5.3 DoReFa-Net
Zhou et al. try to generalize quantization and take advantage of bitwise operations for fixed-
point data with widths of various sizes [38]. They introduce DoReFa-Net, a model with a variable
width size for weights, activations, and even gradient calculations during backpropagation. Zhou
et al. emphasized speeding up training time instead of just inference.
DoReFa-Net claims to utilize bitwise operations by breaking down dot products of fixed-
point values into multiple dot products of binary values. However, the complexity of their bitwise
operations is O(n2) where n is the width of the data used, which is not better than fixed point dot
products.
Zhou et al. point out the inefficiencies of XNOR-Net’s method for calculating a gain term.
DoReFa-Net does not dynamically calculate gain terms using the L1-norm of both activations and
weights. Instead, gain terms are only based on the weights of the network. This allows for efficient
inference since the weights and gain terms do not change after training.
The topology of DoReFa-Net is used throughout the BNN literature which is explained in
Section 3.8.2.
3.5.4 Tang et al.
Tang et al. [60] present multiple ideas for BNNs that are used by others. They do not
present a new topology but binarize AlexNet and focus on classification accuracy on the ImageNet
dataset.
To speed up training, Tang et al. study how the learning rate affects the rate of improvement
in the network and how it affects the rate at which binary values oscillate between −1 and +1. For
45
a given learning rate, the sign of the weights in a BNN oscillates much more frequently than in a
full-precision network. The number of sign changes in a BNN is orders of magnitude more than in
a traditional network. Tang et al. show better training in BNN when small learning rates are used.
Tang et al. take advantage of a gain term in their network and point out the inefficient
manner in which XNOR-Net uses a gain term. They propose to use a learned scaling factor by
using a PReLU layer in their network. As opposed to the ReLU layer which zeros out negative
inputs, PReLU learns a positive gain term to apply to the negative input values. This gain is applied
indirectly within the PReLU.
Tang et al. notice the bulky nature of the FC layers used in previous BNN implementations.
FC layers perform much larger dot products than convolutional layers. In a convolutional layer,
many small dot products are calculated. FC layers perform a single large dot product which is
much larger than those used in convolutional layers. In BNNs, whole values (−1 and +1) are used
instead of the fractional values seen in traditional DNNs. Tang et al. point out that this can lead to
a large range of possible values for the final FC layer, which does not play nicely with the softmax
function used in classification.
Previous works get around this by leaving the final layer at full precision instead of binariz-
ing it. Tang et al. binarize the last layer and introduce a learned gain term at every neuron. With a
binarized last layer, the BNN becomes much more compact since most of the model’s parameters
lie in the FC layers.
To help generalization, Tang et al. emphasize the importance of a regularizer. This is the
first instance of a regularizer used during the training of a BNN that we could find. They also use
multiple bases for binarization which is discussed in Section 3.6.2.
3.5.5 ABC-Net
The ABC-Net model is introduced in [52] by Lin et al. ABC-Net tries to reconcile the
accuracy gap between BNNs and full precision networks. ABC-Net generalizes some of the multi-
bit ideas in DoReFa-Net and the gain terms learned by the network in [60].
ABC-Net binarizes activations and weights into multiple bases. For weights, the binariza-
tion function is given by
Bwi = sign(W +uistd(W )) (3.4)
46
where W is the set of weights being binarized, W =W−mean(W ), std(W ) is the standard deviation
of W and ui is a learned term. A set of Bi binarizations are produced according to the learned
parameters ui. This binarization function centers the weights W around zero and produces different
binarizations according to different threshold biases (uistd(W )).
These binarized linear bases can be used in bitwise dot products with the activations. The
results are then combined in a linear combination with learned gain terms. This technique is
reminiscent of the multi-bit method proposed in DoReFa net, but instead of using the slices from
the powers of two, bases are based on learned bias that act as thresholds. This requires more
calculations, but offers better accuracy than DoReFa-Net for the number of bitwise operations
used. It also uses learned gain terms in the linear combination of the bases which is a more general
use of a gain term than just in a PReLU layer like Tang et al. [60].
The binarization of the weights is aided by calculating the mean and standard deviation of
the weights. Once the weights are learned, there is no need to calculate the mean and standard
deviation again during inference. If the same method were used on the activations, the network
would suffer from a similar problem as XNOR-Net where these values would need to be calculated
again during inference. Instead, ABC-Net makes multiple binarized bases for the activations using
a learned threshold bias without the aid of the mean or standard deviation with
BAi = sign(A+ui) (3.5)
where BAi is the binarized base for the set of activations A and ui is learned threshold bias. A gain
term is learned which is associated with each activation base.
Each binary activation base can be combined with each binary weight base in a dot product.
ABC-Net takes advantage of efficient gain terms and multiple biases, but the computation cost is
higher for each base that is added.
The ABC-Net method is applied to various sizes of ResNet topologies and shows only a
3.3% drop in top-5 accuracy on ImageNet compared to full 32-bit precision when using 5 bases
for both activations and weights.
47
3.5.6 BNN+
Darabi et al. extend some of the core principles of the original BNN by looking at al-
ternatives to the STE used by all previous BNNs. The STE simply uses an identity function for
the backpropagation of the gradient though the signed activation layer. Combining this with the
need to kill gradients of large activations (see Section 3.4.2), the backpropagation of gradients
through sign activation function can be viewed as an impulse function which clips the gradient if
the activation has an absolute value greater than 1.
The BNN+ methodology improves learning through a different effective backpropagation
function in place of the impulse function. Instead of the impulse function, a variation of the
derivative of the Swish-like activation (swish ref) is used:
dSSβ (x)dx
=β (2−βxtanh(βx
2 ))
1+ cosh(βx)(3.6)
where β can be a hyperparameter set by the user or a learned parameter set by the network. Darabi
et al. state that this type of function allows for better training since it is differential instead of
piecewise at −1 and +1.
Another focus of the BNN+ methodology is a regularization function that helps force the
weights towards −1 and +1. They compare two approaches, one that is based on the L-1 norm
R1(w) = |α−|w|| (3.7)
and another that is based on the L-2 norm.
R2(w) = (α−|w|)2 (3.8)
When α = 1 this regularizer is minimized when weights are closer to −1 and +1. The network is
allowed to learn this parameter.
In addition to these new techniques, BNN+ uses a gain term. It is notable that multiple
bases are not used. Compared to other single base techniques, BNN+ reports the best accuracies
on ImageNet, but does not perform quite as well as ABC-Net.
48
3.5.7 Comparison
Here we compare the methods reviewed in this section. Table 3.2 summarizes the accura-
cies of these methods on the CIFAR-10 dataset. Table 3.3 compares accuracies on the ImageNet
dataset. See Section 3.7 for further accuracy comparisons of BNNs. Table 3.4 compared the fea-
tures of each method. The results are listed in each table in chronological order of when they were
published.
It is interesting to note that the results reported by Courbariaux et al. [28] on the CIFAR-10
dataset for the original BNN method achieves the best performance. Most of the work since [28]
has focused on improving results on the ImageNet dataset.
Table 3.2: Comparison of accuracies on the CIFAR-10 dataset from works presented in this section.
Methodology Topology Accuracy (%)
Original BNN BNN 89.85XNOR-Net BNN 89.83
BNN+ AlexNet 87.16BNN+ DoReFa-Net 83.92
3.6 Improving BNNs
Several techniques for improving the accuracy of BNN were reviewed throughout the last
section. We will now cover each technique individually.
3.6.1 Scaling with a Gain Term
Binarization only allows information about the sign of inputs to be passed to the next layers
in the network while information about the magnitude of the input is lost. The resulting values are
either −1 or +1. This allows for efficient computation using bitwise dot product operations at a
cost of lost information in the network. Gain terms can be used after the bitwise dot products have
occurred to give the output a sense of magnitude. Many works on BNN point out that this allows
for a binarization with values of−γ and +γ , where γ is the gain term. This lends the network more
49
Table 3.3: Comparison of accuracies on the ImageNet dataset from works presented in this section.Full precision network accuracies are included for comparison as well.
Methodology Topology Top-1 Accuracy (%) Top-5 Accuracy (%)
Original BNN AlexNet 41.8 67.1Original BNN GoogleNet 47.1 69.1
XNOR-Net AlexNet 44.2 69.2XNOR-Net ResNet18 51.2 73.2DoReFa-Net AlexNet 43.6 -Tang et al. 51.4 75.6ABC-Net ResNet18 65.0 85.9ABC-Net ResNet34 68.4 88.2ABC-Net ResNet50 76.1 92.8
BNN+ AlexNet 46.11 75.70BNN+ ResNet18 52.64 72.98
Full Precision AlexNet 57.1 80.2Full Precision GoogleNet 71.3 90.0Full Precision ResNet18 69.3 89.2Full Precision ResNet34 73.3 91.3Full Precision ResNet50 76.1 92.8
Table 3.4: A table of major details of the methods presented in this section. Activation refersto which kind of activation function is used. Gain describes how gain terms were added to
the network. Multiplicity refers to how many binary convolutions were performed inparallel in place of full precision convolution layers. The regularization column
indicates which kind of regularization was used, if any
Methodology Activation Gain Multiplicity Regularization
Original BNN Sign Function None 1 NoneXNOR-Net Sign Function Statistical 1 NoneDoReFa-Net Sign Function Learned Param. 1 NoneTang et al. PReLU Inside PReLU 2 L2ABC-Net Sign w/Thresh. Learned Param. 5 None
BNN+ Sign w/SSt for STE Learned Param. 1 L1 and L2
50
complexity if used correctly. If −γ and +γ are fed directly into another sign activation function
centered at 0, the gain term would have no effect since sign(+/− γ) = sign(+/−1). BNNs with
BN layers can avoid this pitfall since a bias term is built in. See Section 3.6.7 for more details on
the combination of BN and the sign activation function.
Gain terms can be used to give more capacity to a network when multiple gain terms are
used within a dot product or to form a linear combination of parallel dot products. Instead of simply
changing the values used in the binarization from +1 and −1 to +γ and −γ , different weights can
be added to elements within the binary dot product to make it act more like a full precision dot
product. This is what is done with XNOR-Net [58]. XNOR-Net uses magnitude information to
form a gain term for both the weights and activations. Every “pixel” in a tensor of feature maps
has its own gain term based on the magnitude of all the channels at that “pixel”. Every “kernel”
also has its own gain term. However, as detailed in Section 3.5.2 this is not very efficient. A full
precision convolution is required since the gain of every “pixel” acts as full precision weight.
Instead of using gain terms within the dot products like XNOR-Net, other works use gains
to perform a linear combination between parallel dot products. Some networks use gain terms that
are extracted from the statistics of the inputs [38, 60], and others learn these gain terms [52, 61].
The idea of parallel binary dot products that are combined in a linear combination is often referred
to as multiple bases, which is covered in the next section.
3.6.2 Using Multiple Bases
Multiple binarizations of a single set of inputs have been used to help with the loss of
information during binarization in BNNs. These multiple binarizations can be seen as bases that
can be combined to form a result with more information. Efficient dot products can still be used in
computing outputs, but extra computations are needed to combine the multiple bases together.
DoReFa-Net [38] breaks inputs into bases where each base corresponds to a power of two.
There is one binary base for each power of two needed to represent the data being binarized. The
number of bases needs to match the number of fixed-point bits of precision in the input. DoReFa-
Net uses fewer bits of precision in the data used when less precision is desired. This led to no
loss in information compared to the input and gives the appearance of more efficient computations.
51
However, this technique does not save any computations over regular fixed-point multiplication
and addition.
Another technique is to compute the residual error between a scaled binarization and its
input, then compute another scaled binarization based on that error. This type of binarization is
known as residual binarization. Tang et al. [60] both ReBNet [62] use residual binarizations (which
should not be confused with residual layers in DNNs). Compared to DoReFa-Net, the gain term
for each base is dependent on the magnitude of input value or residual. Information is lost, but
the first bases computed hold more information. This is a more efficient use of the bases than the
straightforward fixed-point base-two method of DoReFa-Net. Floating-point values can be used
and are preferable in such a scheme that is more suitable for GPUs and CPUs that are optimized
for floating-point computations. However, more computations are needed in order to calculate the
residuals of the binarization, a similar problem to XNOR-Net, but on a smaller scale since this is
being done for a handful of bases instead of every “pixel” in a feature map.
Using information from activations in order to compute multiple bases requires more com-
putations, as seen in [60, 62]. ABC-Net [52] simply learns various bias terms for thresholding and
different gain terms for scaling bases when computing activations. By allowing the network to
learn these values instead of computing them directly, there is no extra computations required at
inference time. ABC-Net still uses magnitude information from the weights during training, but
since weights are set after training, constant values are used during inference.
3.6.3 Partial Binarization
Instead of binarizing the whole network, a couple of methods have been proposed to bi-
narize on the parts of the network that benefit the most from the compression and keep the most
essential layers at full precision. The original BNN method and most other BNNs do in fact use
partial binarization since the last layer is kept at a higher precision to achieve the results that they
do. Tang et al. [60] propose a method for overcoming this (see Section 3.5.4).
Other networks have embraced this idea, sacrificing efficiency and model compression for
better accuracy by increasing the number of full precision layers. TaiJiNet [60] divides the kernels
of a convolutional layer in two groups, more important kernels that will not be binarized and
another group of kernels that will be binarized. To determine which kernels are more important,
52
TaiJiNet looks at combinations of statistics using L1 and L2-norms, mean, standard deviation and
variance.
Prabhu at al. [63] also explored partial binarization. Instead of separating out individual
kernels in a convolutional layer, each layer is analyzed as a whole. Every layer in the network is
given a score, then clustering is done to find an appropriate threshold that will split low scores from
high scores deciding which layers will be binarized and which other ones will not.
Wang et al. [64] use partial binarization during the training for better accuracy. The network
is gradually binarized as the network is trained. The method goes against the original method of
Courbariaux et al. [28] where binarization during training was desired. However, Wang et al. argue
that introducing binarization gradually during training helps achieve better accuracy.
Partial binarization is well suited for software-based systems where control is not dictated
by a hardware layout. Full binarization may not take full advantage of the available resources
of a system while a full-precision network may prove to be too difficult. Partial binarization can
be customized to a system but requires extra effort in selecting what parts to binarize. Partial
binarization decisions would need to be application and system specific.
3.6.4 Learning Rate
It has been shown by experience that BNNs take longer to train than full precision networks.
While traditional DNNs can make small adjustments to their weights during optimization, BNNs
update real-valued weights that may or may not affect change in the output of the loss function.
These real-valued weights can be thought of as quasi accumulators that hold a running total of the
gradient for each binary weight. It takes an accumulation of gradient steps to change the sign of
the real-valued weight and thus change the binary weight.
In addition, most of the weights in BNNs converge to either positive or negative [61]. The
binary weights do not change even through backpropagation the optimizer dictates a step in that
same direction. Many of the gradients that are calculated never have any effect on the loss and
never improve the network. For these reasons Tang et al. suggest that a higher learning rate should
be used to speed up training [60].
53
3.6.5 Padding
In DNNs, convolutions are often padded with zeros to help make the topology and data
flow more uniform. This is standard practice for convolutional layers. With BNNs however, using
zero padding adds a third value along with−1 and +1. Since there is no binary encoding for 0, the
bitwise operations are not compatible. We found that this is overlooked in many of the available
software source code provided by authors. Several works focusing on digital design and FPGAs
( [65–67]) point out this problem. Zhao et al. [65] experiment with the effects of just using +1
for padding. Fraser et al. [67] use −1 for padding and report that it works just as well as zero
padding. Guo et al. [66] explore this problem in detail and claim that simple padding with either
−1 or +1 hurts the accuracy of the BNN. Instead, they test alternating padding where the border
is padded with −1s and +1s at every other location. This method improves accuracy, but only
slightly. To help even further, they alternate which value they begin the padding from one channel
to the next. At every location where a −1 for padding in odd-numbered channels, a +1 is used in
even-numbered channels and vice versa. This helps the BNN network achieve accuracy similar to
a zero-padded network.
3.6.6 More Binarization
The original BNN methodology is not completely binarized. As mentioned in Section
3.6.5, the convolutional padding scheme used by the original open-source BNN software imple-
mentation uses zero-padding which introduces 0 values into the network. This turns the network
into a ternary network instead of a binary network. Some hardware implementations get around
this by not using padding at all. The methods mentioned in Section 3.6.5 allows for complete use
of bitwise operations and padding leading to faster run times than with networks that involve 0
values.
Another part of most BNN models that are not completely binary is the first and last layers.
The FBNA [66] methodology focuses on making the BNN topology completely binary. This
includes binarizing the first layer. Instead of using the fixed precision values from the input data,
they use a scheme similar to DoReFa-Net [38] where a smaller bit width is used for the values and
the small bit-width values are split into multiple binarizations without losing precision. Instead of
54
using a base for each power of two used to represent the data, they use as many bases as needed to
be able to sum binary values to get the original value. This seems to be a less efficient technique
since n2 bases are needed for n bit of precision.
Tang et al. [60] introduce a method for binarizing the final FC layer of a network, which is
traditionally left at higher precision. They use a learnable channel-wise gain term within the dot
product to allow for manageable numbers instead of large whole values. Details are provided in
Section 3.5.4.
3.6.7 Batch Normalization and Activations as a Threshold
The costly calculation of the BN layer may seem to contradict the efficiency of the BNN
methodology. However, implementing a BN layer in the forward pass is arithmetically equivalent
to an integer threshold in BNNs. The BN layer computes
BN(x) =x−µ√σ2 + ε
∗ γ +β (3.9)
where x is the input, µ is the mean value of the batch, σ2 is the variance of the batch, ε is added
for numerical stability and γ and β are learned gain and bias terms. Since the activations simply
calculate sign(BN(y)),
sign(y) =
+1 x≥ τ
−1 x < τ
(3.10)
where
τ =−β√
σ2 + ε
γ+µ (3.11)
This is only true if γ is positive. For negatively valued γ , the same comparator would be
used, but x would be negated. Since integer values are produced by the dot product in a BNN, τ
can be rounded appropriately to an integer.
This method is very useful in digital designs where thresholding is much simpler than the
explicit arithmetic required for BNN layers during training.
55
3.6.8 Layer Order
As pointed out in Section 3.5.2, BNNs can be better trained if the pooling layer is placed
before the BN/activation layer. However, in the forward pass, there is no difference in the order of
these particular layers. Umuroglu et al. [68] point out that execution is faster during inference if
the pooling layer comes after the activation. That way binary values can be used, eliminating the
need for comparisons in the max-pooling layer. If any of the inputs is +1, the output is +1. An
OR function can be used on the binary encodings within the network to achieve max pooling.
3.7 Comparison of Accuracy
In this section, we present a comparison of the accuracy of BNNs on a few different
datasets. The accuracies are associated with a specific publication and only include the authors’
self-reported accuracies, not accuracies other authors reported as a comparison.
3.7.1 Datasets
Four major datasets are used to test BNN. We include results for these four datasets. Various
publications exist for specialized applications of BNNs on specific datasets [18, 45, 69–76].
MNIST: A simple dataset of handwritten digits. The images are only 28× 28 pixel
grayscale images with 10 classes to classify. This data set is fairly easy, and FC layers are used
to classify these images instead of CNNs. Courbariaux et al. do this in their original work on
BNNs [28] claiming that is harder with FC layers and is a better test of the BNNs capabilities. The
dataset contains 60,000 images for training and 10,000 for testing.
SVHN: The Street View House Numbers dataset. A dataset of photos of single digits (0–9)
taken from photos of house numbers. These color photos are centered on a single digit. The dataset
contains just under 100,000 32×32 images classified into 10 classes.
CIFAR-10: A dataset of 60,000 32× 32 photos. Contains 10 different classes, 6 different
animals, and 4 different vehicles.
ImageNet: Larger photographs of varying sizes. These images are usually resized to a
common size before processing. Contains images from 1000 different classes. ImageNet is a
much larger and more difficult dataset than the other three datasets mentioned. The most common
56
version of this dataset, from 2012, contains 1.2 million images for training and 150,000 images for
validation.
3.7.2 Topologies
While BNN methods can be applied to any topology, many BNNs compared in the literature
are binarizations of common topologies. We list the topology of networks used as we compare
methods. Papers that did not specify which topology was used are denoted with NK in the topology
column throughout this chapter. For topologies that were ambiguous, we provide as much detail
as was provided by the authors.
All the topologies compared in this section are feedforward topologies. They are either
described by their name if they are well established in the literature (like AlexNet or ResNet) or
we describe them layer by layer with our own notation.
Our notation is similar to others used in the literature. Layer are specified in order. Three
types of layers can be specified: Convolutional layers, C; fully connected layers, FC; and max-
pooling layers, MP. BN layers and activations are implied and are not listed. The number of
output channels of a layer is listed directly after the type of layer. For example, FC1024 is a fully
connected layer with 1024 output channels. The number of input channels can be determined by
the output of the last layer or the size of the input image or the number of activations produced by
the last max-pooling layer. All max-pooling layers have a receptor size of 2× 2 pixels and a stride
of 2. Duplicates of layers also occur often. We list the multiplicity of layers before the type. Two
convolutional layers with 256 output channels could be listed as C256-C256, but we use 2C256
instead.
To better understand and compare the accuracies in this section, we provide a description
of the common topologies used by BNN that are not well known otherwise. We refer to these
common topologies in the comparison tables in this section.
We will refer to the topology of the convolutional BNN proposed by Courbariaux et al. [28]
and used on the SVHN and CIFAR-10 datasets as BNN. It is a variation of a VGG-11 topology
with the following structure: 2C128-MP-2C256-MP-2C512-MP-2FC1024-FC10 as seen in Figure
3.2. Other networks use this same topology but reduce the number of channels by half. We denote
these as 1/2 BNN.
57
Figure 3.2: Topology of the original Binarized Neural Networks (BNN). Numbers listed denote thenumber of output channels for the layer. Input channels are determined by the number of channelsin the input, usually 3, and the input size for the FC layers.
58
We will refer to a common three-layer MLP used as MLP with the following structure:
3FC1024-FC10. 4xMLP will denote an MLP with 4 times as many hidden channels (3FC4096-
FC10).
Some works mention the DoReFa-Net topology. The original DoReFa-Net paper does not
outline any specific topology but instead outlines a general methodology [34]. We suspect that pa-
pers that claim to use the DoReFa-Net topology use a software implementation of DoReFa-Net like
the one included in Tensorpack for Tensorflow, which may be a binarization of a popular topology
like AlexNet. However, since we do not know for sure, we denote these entries as DoReFa-Net.
3.7.3 Table of Comparisons
Seven tables are included in this section to report the accuracies of different BNN method-
ologies for the MNIST (Tables 3.5 and 3.6), SVHN (Tables 3.7 and 3.8), CIFAR (Tables 3.9 and
3.10) and ImageNet (Tables 3.11) datasets. We also report the accuracies of non-binary networks
that are related, like partial binarized networks and BinaryConnect, which preceded BNNs.
Accuracies on MNIST
Table 3.5: BNN accuracies on the MNIST dataset. The accuracy reported for [77] was not explic-itly stated by the authors. This number was inferred from the figure provided.
Source Accuracy(%) Topology
[78] 95.7 FC200-3FC100-FC10[28] 96.0 MLP[77] 97 NK[79] 97.0 LeNet[80] 97.69 MLP[81] 97.86 ConvPool-2[62] 98.25 1/4 MLP[68] 98.4 MLP[82] 98.40 MLP[83] 98.6 NK[84] 98.67 MLP[85] 98.77 FC784-3FC512-FC10
59
Table 3.6: Accuracies on the MNIST dataset of non-binary networks related to works reviewed.
Source Accuracy(%) Topology Precision
[41] 95.15 NK Ternary values[86] 96.9 NK 8-bit values[86] 97.2 NK 12-bit values[84] 98.53 MLP 2-bits values[46] 98.71 BinaryConnect deterministic 32-bit float activations[80] 98.74 MLP 32-bit float[46] 98.82 BinaryConnect stochastic 32-bit float activations[39] 99.1 NK Ternary values
Accuracies on SVHN
Table 3.7: BNN accuracies on the SVHN dataset.
Source Accuracy Topology
[80] 94.9 1/2 BNN[68] 94.9 1/2 BNN[66] 96.9 NK[62] 97.00 C64-MP-2C128-MP-2C256-2FP512-FP10[38] 97.1 DoReFa-Net[28] 97.47 1/2 BNN
Table 3.8: Accuracies on the SVHN dataset of non-binary networks related to works reviewed.
Source Accuracy(%) Topology Precision
[42] 97.60 1/2 BNN Ternary weights[42] 97.70 BNN Ternary weights[46] 97.70 BinaryConnect - deterministic 32-bit float activations[46] 97.85 BinaryConnect - stochastic 32-bit float activations
60
Table 3.9: BNN accuracies on the CIAFR-10 dataset.
Source Accuracy(%) Topology Disambiguation
[87] 66.63 2 conv. and 2 FC[67] 79.1 1/4 BNN[68] 80.1 1/2 BNN[80] 80.1 1/2 BNN[40] 80.4 VGG16[88] 81.8 VGG11[83] 83.27 NK[67] 88.3 BNN[61] 83.52 DoReFa-Net R2 regularizer[61] 83.92 DoReFa-Net R1 regularizer[81] 84.3 NK[67] 85.2 1/2 BNN[89] 85.9 6 conv.[79] 86.0 ResNet-18[90] 86.05 9 256-ch conv.[87] 86.06 5 conv. and 2 FC[91] 86.78 NK[62] 86.98 C64-MP-2C128-MP-2C256-2FC512-FC10[77] 87 NK[61] 87.16 AlexNet R1 regularizer[61] 87.30 AlexNet R2 regularizer[65] 87.73 BNN +1 padding[59] 88 BNN 512 channels for FC[65] 88.42 BNN 0 padding[85] 88.47 6 conv.[66] 88.61 NK[28] 89.85 BNN
Table 3.10: Accuracies on the CIFAR-10 dataset of non-binary networks related to works reviewed.
Source Accuracy(%) Topology Precision
[40] 81.0 VGG16 Ternary values[40] 82.9 VGG16 Ternary values[42] 86.71 1/2BNN Ternary values[42] 89.39 BNN Ternary values[46] 90.10 BinaryConnect - deterministic 32-bit float activations[46] 91.73 BinaryConnect -stochastic 32-bit float activations
61
Table 3.11: BNN accuracies on the ImageNet dataset.
Source Top 1 Acc.(%) Top 5 Acc.(%) Topology Details
[48] 36.1 60.1 BNN AlexNet[38] 40.1 Alexnet[62] 41.43 Details in [62][57] 41.8 67.1 BNN AlexNet[58] 44.2 69.2 AlexNet[38] 43.6 Alexnet Pre-trained on full precision[61] 45.62 70.13 AlexNet R2 reg[61] 46.11 75.70 AlexNet R1 reg[48] 47.1 69.1 BNN GoogleNet[63] 48.2 71.9 AlexNet Partial binarization[58] 51.2 73.2 ResNet18[61] 52.64 72.98 ResNet-18 R1 reg[61] 53.01 72.55 ResNet-18 R2 reg[92] 54.8 77.7 ResNet-18 Partial binarization[93] 55.8 78.7 AlexNet Partial binarization[52] 65.0 85.9 ResNet-18 5 bases[52] 68.4 88.2 ResNet-34 5 bases[52] 70.1 89.7 ResNet-50 5 bases[94] 75 VGG 16[60] 75.6 51.4 AlexNet binarized last layer
Accuracies on CIFAR-10
Accuracies on ImageNet
3.8 Hardware Implementations
3.8.1 FPGA Implementations
FPGAs are a natural platform for BNNs when performing inference. BNNs take advan-
tage of bitwise operations when performing dot products. While CPUs and GPUs are capable
of performing these operations, they are optimized for a range of tasks, especially integer and
floating-point operations. FPGAs allow for custom data paths. They specifically allow for hard-
ware architectures optimized around the XNOR and popcount operations. FPGAs are generally
low-power platforms compared to CPUs, and especially GPUs. They usually have smaller plat-
forms than GPU.
62
3.8.2 Architectures
FPGA DNN architectures usually fall under one of two categories, streaming architectures
and layer accelerators. Steaming architectures have dedicated hardware for all or most of the layers
in a network. These types of architectures can be pipelined, where each stage in the architecture
can be processing different input samples. This usually offers higher throughput, reasonable la-
tency, and requires less memory bandwidth. They do require more resources since all layers of the
network need dedicated hardware. These types of architectures are especially well suited for video
processing. This style is the most commonly found throughout the literature.
Layer accelerators provide modules that can handle only a specific layer of a network.
These modules need to be able to handle every type, size, and channel width of input that may be
required of it. Results are stored in memory to be fed back into the accelerator for the next layer
that will be processed. These types of architectures do not require as many resources as streaming
architectures but have much lower throughput. These types of architectures are well suited for
constrained resource designs where high throughput is not needed. The feedback nature of layer
processors also makes them well suited for RNNs, as seen in [95].
FPGAs typically include digital signal processors (DSPs) and block memory (BARMs)
built into the logic fabric. DSPs can be vital in full precision DNN implementations on FPGAs
where they can be used to compute multi-bit dot products. In BNNs however, dot products are
bitwise operations and DPSs are not used as extensively. Nakahara et al. [96] show the effective-
ness of in-fabric calculation in BNNs over methods that use DSPs. BRAMs are used in BNNs to
store activations, weights, and other parameters. They offer storage for sliding windows used in
convolutions.
CPU-FPGA hybrid systems offer a CPU and FPGA connected in the same silicon. These
systems are widely used to implement DNNs and BNNs [62,65,66,68,80,87–89,94,97]. The CPU
is flexible and easily programmed to load inputs to the DNN. The FPGA can then be programmed
to execute the BNN architecture without extra logic for input and output processing.
63
3.8.3 High-Level Synthesis
FPGAs can be difficult to program for those who do not have specialized training. To help
software programmers without experience with hardware design, tool flows have been designed
where programmers can write a program in a language like C++ which is then synthesized into an
FPGA hardware design. This type of workflow is called High-Level Synthesis (HLS). HLS has
been a major component of the research done with BNNs on FPGAs [62, 65–67, 80, 82, 89, 94, 95,
98].
Yaman Umuroglu et al., from the Xilinx Research Labs, provided a specific workflow
designed for training and implementing BNNs called FINN [68]. Training of a BNN is done with
a deep learning library. The trained model is then used by FINN to produce code for the BNN
which it synthesizes into a hardware design by Xilinx’s HLS tool. The FINN tool received an
extension allowing it to work with BNN topologies for LSTMs [95]. Xilinx Research labs also
extended the capabilities of FINN by allowing it to work with multi-bit quantized networks, not
just with BNNs [80].
3.8.4 Comparison of FPGA Implementations
We provide a comparison of BNN implementations in FPGA platforms in Table 3.12.
Details regarding accuracies, topologies, FPGA usage, and FPGA execution are given. Note
that [62, 89] were the only works that reported significant DPS usage and DSP usage was left
out of Table 3.12.
3.8.5 ASICs
While FPGAs are well suited for processing BNNs and take advantage of their efficient
bitwise operations, custom silicon designs, or ASICs, have the potential to provide the ultimate
power and computational efficiency for any hardware design. FPGAs fabrics can be configured
for BNN topologies, but the physical layout of FPGAs never changes. The fabric and resources
are made to fit a wide variety of applications. Hardware layout in ASIC designs can be changed
to fit the specifications for BNNs. Bitwise operations can be even more efficient in ASIC designs
than they are in any other platform [70, 71, 83, 90, 99–101]. ASIC designs can integrate image
64
Tabl
e3.
12:
Com
pari
son
ofFP
GA
impl
emen
tatio
ns.
The
accu
raci
esre
port
edfr
om[9
4,96
]w
ere
note
xplic
itly
stat
ed.
The
senu
mbe
rsw
ere
infe
rred
from
the
figur
espr
ovid
ed.
The
accu
racy
for
[94]
isas
sum
edto
bea
top-
5ac
cura
cyan
dth
eac
cura
cyfo
r[6
2]is
assu
med
toto
p-1
accu
racy
,but
this
was
neve
rst
ated
byth
eir
resp
ectiv
eau
thor
s.D
atas
ets:
MN
IST
=M
N,S
VH
N=
SV,
CIF
AR
-10
=C
I,Im
ageN
et=
IN.
Sour
ceD
atas
etA
cc.
Topo
logy
FPG
AL
UT
sB
RA
Ms
Clk
FPS
Pow
er(%
)(M
Hz)
(W)
[80]
MN
97.6
9M
LP
Zyn
q702
025
358
220
100
2.5
[80]
MN
97.6
9M
LP
Zyn
qUltr
a3E
G38
205
417
300
11.8
[62]
MN
98.2
5Se
eTa
ble
3.5
Spar
tan7
5032
600
120
200
[68]
MN
98.4
ML
PZ
ynq7
045
8298
839
615
6100
022
.6[8
2]M
N98
.40
ML
PK
inte
x732
5T40
000
110
100
1000
012
.22
[68]
SV94
.91/
2B
NN
Zyn
q704
546
253
186
2190
011
.7[6
6]SV
96.9
6C
onv/
3FC
Zyn
q702
029
600
103
6451
3.2
[62]
SV97
.00
See
Tabl
e3.
7Z
ynq7
020
5320
028
020
0
[87]
CI
66.6
32
Con
vs/2
FCZ
ynq7
045
2026
4[9
6]C
I78
See
Tabl
e3.
9V
erte
x769
0T20
352
372
450
15.4
4[6
7]C
I79
.11/
4BN
NK
inte
xUltr
a11
535
818
144
125
1200
0[8
0]C
I80
.10
1/2B
NN
Zyn
qUltr
a3E
G41
733
283
300
10.7
[80]
CI
80.1
01/
2BN
NZ
ynq7
020
2570
024
210
02.
25[6
8]C
I80
.11/
2BN
NZ
ynq7
045
4625
318
621
900
11.7
[88]
CI
81.8
1/2B
NN
Zyn
q702
014
509
3214
342
02.
3[6
7]C
I85
.21/
2BN
NK
inte
xUltr
a11
593
755
386
125
1200
0[8
9]C
I85
.9Se
eTa
ble
3.9
Zyn
q702
023
426
135
143
930
2.4
[87]
CI
86.0
65
Con
vs/2
FCV
erte
x798
0T55
6920
340
3321
58[6
2]C
I86
.98
See
Tabl
e3.
9Z
ynq7
020
5320
028
020
0[6
5]C
I87
.73
See
Tabl
e3.
9Z
ynq7
020
4690
014
014
34.
7[6
7]C
I88
.3B
NN
Kin
texU
ltra
115
3929
4718
1412
512
000
[66]
CI
88.6
16
Con
v/3
FCZ
ynq7
020
2960
010
352
03.
3
[62]
IN41
See
Tabl
e3.
11V
irte
xUltr
a09
510
7520
034
5620
0[9
4]IN
75V
GG
116
Zyn
qUltr
a9E
G19
1784
1367
150
31.4
822
65
sensors [102] and other peripheral elements into their design for fast processing and low latency
access.
Nurvitadhi et al., from Intel’s Accelerator Architecture Lab, designd a layer accelerator for
BNNs in an ASIC [99]. They compare the execution performance of the ASIC implementations
with implementations in an FPGA, CPU, and GPU. They show that power can be significantly
lower in an ASIC while maintaining the fastest execution times.
Since BNNs require a large number of parameters, like most DNNs. A handful of ASIC
designs focus on in-memory computations [94,103–106]. Custom silicon also allows for the use of
mixed technologies and memory-based designs in resistive RAM (RRAM). RRAM is an emerging
technology and is an appealing platform for BNN designs due to its compact operation at the bit
level [85, 86, 103, 107, 108].
3.9 Conclusions
BNNs can provide substantial model compression and inference speedups over traditional
DNNs. BNNs do not achieve the same accuracy as their full precision counterparts, but improve-
ments have been made to close this gap. BNNs appear to be better replacements for smaller
networks rather than larger ones, coming within 4.3% top-1 accuracy for the small ResNet18 but
6.0% top-1 accuracy on the larger ResNet50.
The use of multiple bases, learned gain terms, learned bias terms, intelligent padding, and
even partial binarization have helped to make BNNs accurate while still executing at high speeds
maintaining small sizes. These speeds have been accelerated even further as BNNs have been
implemented in FPGAs and ASICs. New tool flows like FINN have made programming BNNs on
FPGA accessible to more designers.
66
CHAPTER 4. USING FULL PRECISION METHODS TO SCALE DOWN BINARIZEDNEURAL NETWORKS
Most of the developments made to BNNs focus on scaling up BNNs to handle large and
complex image classification datasets. However, even the smallest BNNs proposed in the literature
do not fit on stand-alone mid-sized FPGAs [67]. Many proposed improvements to BNNs move
away from hardware and FPGA-friendly operations [58].
This chapter and Chapter 5 explore techniques for scaling down BNNs in order to im-
plement them on resource-limited systems like embedded computers and midsized FPGAs. This
chapter explores methods that already exist for full-precision networks, which we apply to BNNs.
Chapter 5 introduces a novel method specifically designed for BNNs. We find that the methods
explored in this chapter are not particularly effective for scaling down BNNs. We hypothesize that
it is necessary to use methods specifically designed for binary values in order to effectively scale
down BNNs, which motivates our work in Chapter 5.
4.1 Depthwise Separable Convolutions
Depthwise separable convolutional layers were first introduced in [29]. They reduce the
number of required operations by convolving each of the input channels of the convolutional layers
with 2D kernels and then combining them with 1×1 point-wise filters.
Standard convolutional layers use a 3D filter of size K×K×N, where K is the width and
height of the kernel and M is the number of input channels, as shown in Figure 4.1. N number
of these filters are used to produce N output channels. Depthwise separable convolutional layers
use M number of K×K 2D kernels. Each of these kernels is applied to a single input channel. N
outputs channels are formed by combining these convolutions with N 1×1×M pointwise filters,
shown in Figure 4.2.
67
Figure 4.1: Standard convolutional layers use N number of K ×K ×M filters, where M is thenumber of input channels and N is the number of output channels.
Figure 4.2: Depthwise separable convolutional layers use M number of K×K 2D filters, one foreach input channel, and combines them with N number of pointwise filters with a depth of M.
68
4.1.1 Experiments and Results
We tested models that use binarized depthwise separable convolutional layers on the CIFAR-
10 dataset. We constructed models using a VGG style topology, which is common through-
out the literature. The models had the following structure: C128-BN-C128-MP-BN-C256-BN-
C256- MP-BN-C521-BN-C512-MP-BN-FC1024-BN-FC1024-BN-FC10-SM. C represents a con-
volutional layer followed by the number of filters, BN represents a Batch Norm layer, MP repre-
sents 2× 2 max-pooling layers with a stride of 2, FC represents fully connected layers followed
by the number of units, and SM represents the softmax function. The topology is shown in Figure
4.3. We tested the depthwise separable layers against standard BNN convolutional layers. We also
tested models with convolutional layers where only the pointwise convolutional kernel or 3× 3
kernels were binarized.
We saw a significant drop in accuracy when using depthwise separable binarized convolu-
tional layers compared to standard convolutional layers. Figure 4.4 shows out test results. Using
separable filters decreases the accuracy from 80% to 70% and using real values in either the point-
wise or depthwise parts of the separable filter does not offer an increase in accuracy.
4.1.2 Discussion
The degradation in accuracy we see when using the depthwise separable filters is too much
to recommend the use of depthwise separable filters in BNNs. Yang et al. performed similar
experiments for object tracking tasks but did not show any significant advantages when using
depthwise separable BNNs [109]. Our experiments showed that is it essential to use real values in
either the pointwise or depthwise filters. Using standard fully binarized BNN convoultional layers
offers just as much accuracy as only binarizing ether the depthwise or pointwise components of
the separable filters and uses only binary values. We do not recommend using binarized depthwise
separable filters.
4.2 Direct Skip Connections in BNNs
Direct skip connections were first introduced by Huang et al. as part of the DenseNet
model [14]. These types of skip connections are different from the residual skip connections
69
Figure 4.3: The topology to test BNN models with depthwise separable filters. The same topologywas used for both the models with standard convoultional layers and depthwise separable layers.
70
Figure 4.4: The classification accuracy of depthwise separable filters on the CIFAR-10 dataset.Standard BNNs were tested for comparison. We also tested separable filters where only the depth-wise or pointwise parts of the separable filter were binarized. The accuracy is significantly de-creased when using separable filters and using real-value in either separable components do notincrease accuracy over standard BNNs.
introduced in the ResNet model [110]. Instead of adding a previous layer’s outputs to a downstream
layer’s inputs, dense skip connections concatenate a previous layer’s outputs as extra channels to
a downstream layer’s outputs. In addition, direct skip connections connect every layer to all layers
that follow, as shown in Figure 4.5. This allows for gradients to pass freely from every output
layer to every other previous layer. This can help boost the strength of gradient signals to all layers
during backpropagation, which is a weakness of BNNs.
4.2.1 Experiments and Results
We constructed a BNN model that used direct skip connections. We also built a standard
BNN model with the same number of weights as the proposed BNN model. Figure 4.5 illustrates
the proposed connections in contrast to those of standard BNNs.
These connections did not seem to improve large BNNs much. However, they appeared
to effectively allow for channel reduction and downscaling. Figure 4.6 compares classification
71
test accuracy on the CIFAR-10 data set using standard BNN topology compared to our proposed
method using direct skip connections. As the number of base channels in the convolutional layers
was reduced, the accuracy of models with skip connections performed better than models without.
We hypothesize that the skip connections compensate for the limited gradient flow imposed by the
binarization of the network and the reduced number of channels. With direct skip connections,
the gradient information can flow to all layers of the network, passing through fewer binarization
operations.
Figure 4.5: An illustration of standard feedforward connections (left) and direct skip connections(right). Direct skip connections concatenate the outputs of previous convolutions to form the inputto the next layer of convolution. Including these connections help BNNs with a small number ofweights.
72
Figure 4.6: Plots of the test accuracy during training on the CIFAR-10 dataset. Models with skipconnections were compared to models without. Each plot specifies by the number of base convo-lution channels. As the number of convolution channels decreases, models with skip connectionsperform better than the models without skip connections.
4.2.2 Discussion
Our experiments with dense connections in BNNs seemed to show us that the number of
weights and operations in BNNs can be reduced without sacrificing much accuracy. It seemed that
when direct skip connections were present, reducing the number of parameters does not reduce the
accuracy of the network as much as BNNs without direct skip connections. However, there is a
drawback to using these types of connections. The activations from early layers in the network need
to be stored in order to be reused at inputs to every other layer in the network. This increases the
73
memory requirements for the network. Dense skip connections can be used to increase accuracy
on small BNNs, but since it comes with an increased memory cost, we do not propose dense skip
connections as an effective means for implementing BNNs on smaller FPGAs.
74
CHAPTER 5. NEURAL JET FEATURES
Due to the bitwise nature of BNNs, there have been many efforts to implement BNNs on
ASICs and FPGAs [62,65,66], as reviewed in Chapter 3. While BNNs are excellent candidates for
these kinds of resource-limited systems, most implementations still require very large FPGAs or
CPU-FPGA co-processing systems [80, 94]. This chapter focuses on reducing the computational
cost of BNNs even further, making them more efficient to implement on resource-limited systems.
We target embedded visual inspection tasks, like defect detection on manufactured parts and the
sorting of agricultural produce. We propose a new binarized convolutional layer, called the Neural
Jet Features layer. This layer uses deep learning to learn essential classic computer vision kernels
that are efficient to calculate as a group. We show that on visual inspection tasks, Neural Jet
Features perform comparably to standard BNN convolutional layers while using less computational
resources. We also show that Neural Jet Features tend to be more stable than BNN convolutions
layers when training small models.
5.1 Introduction
In Chapter 2 we presented Jet Features [111], a set of convolution kernels that are efficient
to compute on resource-limited systems, which utilize the key scaling and partial derivative prop-
erties of classic image processing kernels. We demonstrated their effectiveness by replacing the
set of traditional image features used in the ECO Features algorithm [19] with Jet Features, which
allowed the algorithm to be effectively implemented on an FPGA for high-speed, low-power clas-
sification without sacrificing image classification accuracy.
In this chapter, we propose a new binarized convoultional layer called the Neural Jet Fea-
tures layer. The Neural Jet Features layer is a convolutional layer that is trained to form Jet Fea-
tures within a BNN using DL methods. This creates a BNN model that is less costly to compute
75
on resource-limited systems while maintaining comparable classification accuracy. We also show
that Neural Jet Features are more stable during training on the MNIST dataset.
5.2 Neural Jet Features
Neural Jet Features are Jet Features that are learned through DL and used in place of stan-
dard binarized convolutional filters. Neural Jet Features require fewer parameters and fewer opera-
tions than the traditional convolutional layers used in BNNs. They learn essential classic computer
vision kernels, combined through deep learning. It is not possible for standard BNN convolutional
layers to learn these classic computer vision features. The results in Section 5.3 show that BNNs
using Neural Jet Features achieve comparable accuracy on certain image classification tasks com-
pared to BNNs using binary conventional layers. Convolutions typically account for the majority
of operations in a BNN, and by using fewer operations to perform convolution, Neural Jet Features
allow BNNs to fit into resource-limited systems, like embedded computers and FPGAs,
The small 2× 1 kernels that make up a Jet Feature, as shown in Figure 5.1, differ only
in orientation and whether their second element contains a -1 or 1. A 3× 3 Jet Kernel is formed
from four of these smaller kernels, thus four binary weights need to be selected to form a 3×3 Jet
Feature.
Figure 5.1: The 2×1 kernels that are used to form Jet Features.
76
5.2.1 Constrained Neural Jet Features
In Chapter 2, we observed that the genetic algorithm selected scaling factors ([1,1]) more
frequently than partial derivatives ([1,−1]) when forming Jet Features. Based on this observation,
we experimented with a constrained version of Neural Jet Features where one of the vertical pa-
rameters and one of the horizontal parameters was forced to be 1, forming scaling factors more
often than partial derivatives. This reduced the computational cost of Neural Jet Features (Section
5.2.3) while increasing the average accuracy in some of our testing compared to unconstrained
Neural Jet Features (Section 5.3). With only two binary parameters to be learned per kernel, there
were only four possible kernels, the Gaussian kernel, vertical and horizontal Sobel kernels, and a
diagonally oriented edge detector, as shown in Figure 5.2. The constrained version of Neural Jet
Features was more efficient than the unconstrained version with comparable or better accuracy.
The constrained form of Neural Jet Features is the proposed form of Neural Jet Features.
5.2.2 Jet Features
5.2.3 Computational Efficiency
Neural Jet Features learn how to combine input channels using these four classical com-
puter vision kernels, shown in Figure 5.2. These kernels have been essential to traditional computer
vision and are often used multiple times within a single algorithm [2, 3]. Since there are only four
possible kernels to be learned, it may make more sense to view a Neural Jet Feature layer as a layer
that learns to combine these four features with the four features of all the other input layers. Even
though there are only four possible features to learn for each input channel, there are N4 unique
ways in which to combine the features of the input channels to form a single output channel, where
N is the number of input channels.
Like traditional convolutional layers, Neural Jet Features reuse weights in ways that em-
ulate traditional computer vision and image processing. Instead of learning separate weights for
every combination of input and output pixel, as done in fully connected layers, convolutional layers
form weights into kernels, that are applied locally and reused throughout the entirety of the input
image, greatly reducing the number of weights to be learned. Similarly, Neural Jet Features also
reuse weights. Neural Jet Features do not learn unique kernels for each and every combination of
77
Figure 5.2: The classic computer vision kernels that can be constructed with ”constrained” NeuralJet Features (see section 5.2.1)
input and output channels. Instead, all four 3× 3 Jet Features are applied to each input channel
and then reused as needed to form the output channels. This reduces the number of operations
required, especially when there are more than just a few output channels.
All four Jet Features are made up of similar 2× 1 convolutions, as shown in Figure 2.3.
Since all four Jet Feature are always calculated for every input channel (and reused as needed),
these convolutions can be effectively calculated as a group. The smaller 2× 1 convolutions that
are common to multiple features can be calculated first and reused to calculate the larger 3×3 Jet
Features. Figure 5.3 shows how these 2×1 kernels can be applied in an effective manner to reduce
the number of operations that are needed. By contrast, four arbitrary 3× 3 binary convolutions
78
are not guaranteed to share any common operations, thus they must be calculated independently of
each other.
Both of these aspects, kernel reuse and common 2×1 convolutions, make Neural Jet Fea-
tures computationally efficient compared to standard binary convolutions. These bitwise efficien-
cies are particularly well suited for FPGA and ASIC implementations where data paths are de-
signed at a bitwise level of granularity.
We illustrated the computational efficiency of Neural Jet Features with a potential FPGA
hardware design, shown in Figure 5.3. This top diagram shows a typical multiply-accumulate
operation for an input channel in a BNN. The multiplication operations are simply bitwise XNOR
operations. The addition operations are organized into a tree structure to reduce the resources
needed. One accumulation tree is required for every output channel. By contrast, the number of
accumulation operations in a Neural Jet Feature layer does increase with added output channels.
All features are calculated and reused as needed to form the output channels. The addition and
subtraction units in the bottom two diagrams are the same units shown in Figure 2.10 in Chapter 2.
To form a rough comparison of the computational cost of each of the options shown in Fig-
ure 5.3, we assign a cost of 1, 2, 3, or 4 units to each of the operations depending on the number of
bits in their operands. We assign a cost of 2 to the final addition of the standard BNN convolutional
layer which has a 4-bit operand and a 1-bit operand. The standard BNN would cost 13 units per
output channel. For the unconstrained Neural Jet Features the cost would be 61 units to produce
all 9 features. The constrained version would only cost 27 units for all possible Jet Features. In
layers with 3 or more output channels, calculating all of the constrained Jet Features is less expen-
sive than standard BNN convolutions. For example, in a layer with 64 output channels, a standard
BNN layer would cost 832 units (64× 13) while a constrained Neural Jet Feature layer would
cost only 27 units, since the features would be reused as needed. The number of accumulation re-
sources does not scale the number of output channels like they do in standard BNN convolutional
layers. We note that these comparisons are hypothetical and as part of our future work we plan to
implement standard BNN convolutional layers and Neural Jet Feature layers in an FPGA to more
accurately demonstrate their computational efficiency.
79
Figure 5.3: An example of how the operations in Neural Jet Features can be arranged. The numberof operations to calculate all features in a standard BNN convolutional layer scales with the numberof output channels. For Neural Jet Features layers, the number of operations to calculate thefeatures is the same regardless of how many output channels there are.
80
5.3 Results
We tested Neural Jet Features on datasets that are representative of visual inspection tasks
where the images are of a somewhat consistent scale and/or orientation. We used the BYU Fish
dataset [111], BYU Cookie dataset (Section 5.3.3), and the MNIST dataset [10]. The MNIST
dataset has a bit more variety, which is not typical of visual inspection datasets, but it does lend
insight into how Jet Features compares on a widely used dataset.
5.3.1 Model Topology
We experimented with three different types of convolutional layers: standard binarized
convolutional layers [28], unconstrained Neural Jet Feature layers, and constrained Neural Jet
Feature layers (see Section 5.2.1). For all experiments, we used a similar VGG style model topol-
ogy: Conv-BN- Conv-MP-BN-Conv-BN-Conv-MP-BN-FC-BN-FC-BN-SM (Figure 5.4), where
Conv represents convolutional layers, BN represents batch normalization layers, MP represents
max pool layers, FC represents fully connected layers and SM represents a softmax activation
with a window of 2× 2 and stride of 2, FC represents fully connected layers where the first one
contains the number of nodes specified by the experiment and the second one contains the same
number of nodes as there are different classes in the dataset being used. The number of filters in
the convolutional layers and the number of neurons in the first fully connected layer is specified
by the experiment, shown in Table 5.1. We note that these models are smaller than most BNNs
used throughout the literature [112]. All convolutional layers used the same number of filters for
a given experiment. The activation function was the binarization operation that takes place after
the batch normalization layers, except after the final batch normalization layer where a softmax
function was used. The inputs were not binarized, similar to other BNNs throughout the literature.
5.3.2 BYU Fish Dataset
The BYU Fish dataset consists of images of eight different fish species, all oriented the
same way, 161 pixels wide and 46 pixels high. Examples are shown in Figure 2.14 in Chapter 2.
The model used when testing the BYU dataset used 8 filters in the convolutional layers and 16
neurons in the fully connected layer.
81
Figure 5.4: The model topology used for all experiments. The number of filters in the convolutionallayers and the number of nodes in the first fully connected layer change per experiment.
Table 5.1: The layer sizes used for each dataset.
Dataset Conv. Filters Fully Connected Units
BYU Fish 8 16
BYU Cookie 8 8
MNIST 16 32
MNIST 32 128
MNIST 64 256
82
From the results shown in Figure 5.5, we see that the standard BNN convolutional layers
and the constrained Neural Jet Features performed similarly on the BYU Fish dataset, both reach-
ing an average accuracy of 99.8% accuracy. Unconstrained Neural Jet Features performed worse,
hovering around 95% accuracy, significantly worse than the constrained Neural Jet Features. A
similar pattern was shown with the BYU Cookie dataset as well, which we hypothesize is due
to the fact that the unconstrained Neural Jet Features are allowed to learn features that are not as
useful as the one the constrained version learns.
Figure 5.5: Classification accuracy on the BYU Fish dataset
5.3.3 BYU Cookie Dataset
The BYU Cookie dataset includes images of sandwich-style cookies that are either in good
condition, offset, or broken, as seen in Figure 5.6. These images are 100×100 pixels in size. This
dataset is fairly small with 345 training images and 88 validation images.
The validation accuracy on the BYU Cookie dataset, shown in Figure 5.7, shows that the
normal BNN convolution and constrained Neural Jet Features outperform unconstrained Neural Jet
83
Figure 5.6: Examples from the BYU Cookie Dataset. The images in this dataset are 100× 100pixels from three classes, good (left), offset (middle) or broken (right).
Features. In addition, we saw that validation accuracy is sporadic over the course of training, which
is to be expected when dealing with a smaller dataset. The results seem to be more consistent when
using constrained Neural Jet Features than standard BNN convoultional layers, which can also be
observed in the results from the MNIST dataset.
Figure 5.7: Classification accuracy on the BYU Cookie dataset
84
5.3.4 MNIST Dataset
The MNIST dataset consists of 70,000 images of handwritten numbers [10], 28×28 pixels
in size. We trained three models of different sizes on this dataset: one model consisting of 16
filters and 32 fully connected units, one with 32 filters and 128 fully connected units, and another
with 64 filters and 256 fully connected units, which are smaller than other models trained on this
dataset [112]. Figures 5.8, 5.9, and 5.10 show the validation accuracy of these models, respectively.
The scale of the y-axis is kept consistent between these three figures in order to easily compare the
results between each of them. In the larger model, the average accuracy of all models approached
99%.
On the smaller models, the normal BNN convolutions produce inconsistent results, shown
in Figures 5.8 and 5.9, and as seen on the BYU Cookie dataset. This demonstrates a known
difficulty in working with small BNNs. Switching binarized weights between the values -1 and 1
can have dramatic effects in local regions of the network during the learning process. By adding
more weights to a model, this effect is mitigated, as seen in Figure 5.10. Our experiments show
that Neural Jet Features are less susceptible to this effect, making them a good choice for smaller
BNN models. We postulate that this is due to the fact that Neural Jet Feature convolutional layers
are limited to the classic computer vision kernels shown in Figure 5.2.
5.4 Conclusion
We have presented Neural Jet Features, a binarized convolutional layer that combines the
power of deep learning with classic computer vision kernels. Not only do Neural Jet Features
require fewer operations than standard convolutional layers in BNN’s, but are also more stable in
smaller models that would be used for visual inspection tasks. Neural Jet Features have comparable
accuracy on visual inspection datasets while requiring fewer operations and parameters. Neural Jet
Features offer an efficient solution for resource-limited systems that can take advantage of their
bitwise efficiency, like ASIC and FPGA designs.
85
Figure 5.8: Classification accuracy on the MNIST dataset with 16 filters and 32 dense units.
Figure 5.9: Classification accuracy on the MNIST dataset with 32 filters and 128 dense units.
86
CHAPTER 6. CONCLUSION
6.1 Discussion
Resource-limited systems, like embedded computers, FPGAs, and ASICs, are compact
and low power. They are ideal platforms for many industrial visual inspection tasks. However,
high-speed image classification algorithms usually require many floating-point operators and large
amounts of memory. Resource-limited systems are not capable of running mainstream image
classification algorithms at high speeds. Large amounts of floating-point values and operations are
not well suited for resource-limited systems. In this work, we have explored the use of binary-
valued algorithms for visual inspection tasks. Binary values are well suited for resource-limited
systems, especially FPGAs and ASICs where individual bits can be easily manipulated.
The current state-of-the-art image classification algorithms are deep learning (DL) models.
They rely on large amounts of floating-point values and operations. They can be trained without
any prior knowledge of the target domain. Traditional image classification algorithms are less ex-
pensive than DL models but require more expert knowledge of the target domain during training.
Our work looks at both traditional and DL image classification techniques. We combine the el-
egance of traditional image classification with the power of DL models to make binarized image
classification algorithms that are suitable for high-speed image classification on resource-limited
systems.
In Chapter 2, we looked at an existing image classification technique that uses traditional
computer vision. The ECO Features algorithm uses a grab bag of standard image processing func-
tions and combines them using a genetic algorithm. Not all image processing functions were
suitable for resource-limited systems. The functions that were selected by the genetic algorithm
most often were convolutional kernels. These particular kernels could all be derived from smaller
binary-valued convolution kernels. We replaced the original grab bag of image processing func-
tions with these binary-valued convolutions, which we call Jet Features. This reduced the compu-
88
tational and memory costs of the algorithm. It became much simpler to implement the algorithm
on an FPGA. Compared with the original ECO Features algorithm, we saw a 3.7× speed up on
an embedded system and 78× speed up on an FPGA over a full desktop system. Accuracy was
similar for both algorithms.
Chapter 3 reviewed Binarized Neural Networks (BNNs). These networks use binary values
for both the weights and activations. They are trained through deep learning. The original BNN
introduced by Courbariaux et al. was able to process small datasets like the MNIST dataset. Much
of the work on BNNs has focused on improving accuracy on more challenging datasets like Im-
ageNet. FPGA designers have also paid special attention to BNN. There have been many works
that implement BNNs on FPGAs. Most of these works require CPU/FPGA crossing systems or
large stand-alone FPGAs.
Chapters 4 and 5 look at ways to scale down the size of BNNs to allow them to fit on
smaller, more affordable FPGAs. In Chapter 4, we explored some techniques that are used for
full precision systems and applied them to BNNs. However, they did not translate well to BNNs.
In Chapter 5, we introduced Neural Jet Features, which are Jet Features that are trained using the
same deep learning methods used in BNNs. Neural Jet Features were just as accurate as BNNs on
visual inspection tasks but only require a fraction of the number of operations. They are also more
stable when training smaller models.
6.2 Summary of Contributions
• Developed Jet Features:
– Uses common building blocks that can create essential convolution kernels like the
Gaussian and Sobel filters.
– Can be computed efficiently as a whole set, reusing common building blocks.
– Uses binary value, eliminating the need for any multiplication and floating-point values.
– Can be implemented on an FPGA using only line buffers and addition/subtraction units.
• Designed a software implementation of the ECO Jet Features algorithm with a 3.7× speedup
over the original ECO Features Algorithm.
89
• Implemented the ECO Jet Features Algorithm on an FPGA which achieved a 78× speedup
over the original ECO Features Algorithm on a full-sized desktop.
• Surveyed the literature of BNNs and compared the most prominent methods, weight their
strengths and weaknesses.
• Compared the various FPGA implementations of BNNs through the literature.
• Adopted full-scale deep learning techniques for BNN in order to reduce the size of BNN
models.
• Developed the Neural Jet Feature Layer to replace the convolutional layer in BNNs and
compared them with standard BNNs.
– Allows the network to reuse filter outputs in order to reduce the required number of
operations and memory space.
– Computes multiple features as a group.
– Performs just as well, if not better than standard BNNs on visual inspection tasks while
only using a fraction of the required number of operations.
6.3 Future Work
The Neural Jet Features algorithm needs to be explored further. We plan on implementing
Neural Jet Feature layers on an FPGA. This will give us more insight into how the method can be
further developed to reduce the computational and memory costs of the algorithm. We have not
yet explored reusing Jet Features between layers. This could be explored through the use of skip
connection, but only of individual layer features. This may reduce the memory costs that were
discussed in Section 4.2.2.
We have not yet explored the use of Neural Jet Features mixed with standard BNN convolu-
tional layers. Neural Jet Features use the same classic kernels that are used in the image processing
portions of traditional image classification systems (See Figure 1.1). It may be advantageous to use
Neural Jet Features as front end image processing stage, followed by standard BNN convolutional
layers, followed by the fully connected layer back end. This may allow the algorithm to perform
90
well on more complex datasets while still reducing the number of operations and memory space
requirements.
91
REFERENCES
[1] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L., 2015. “ImageNet Large ScaleVisual Recognition Challenge.” International Journal of Computer Vision (IJCV), 115(3),pp. 211–252. 3, 4, 6
[2] Lowe, D. G., 2004. “Distinctive image features from scale-invariant keypoints.” Interna-tional Journal of Computer Vision, 60(2), Nov, pp. 91–110. 3, 77
[3] Bay, H., Tuytelaars, T., and Van Gool, L., 2006. “Surf: Speeded up robust features.” InComputer Vision – ECCV 2006, A. Leonardis, H. Bischof, and A. Pinz, eds., Springer BerlinHeidelberg, pp. 404–417. 3, 77
[4] Rublee, E., Rabaud, V., Konolige, K., and Bradski, G., 2011. “Orb: An efficient alternativeto sift or surf.” In 2011 International Conference on Computer Vision, pp. 2564–2571. 3
[5] Cortes, C., and Vapnik, V., 1995. “Support-vector networks.” In Machine Learning,pp. 273–297. 3
[6] Kotsiantis, S. B., 2013. “Decision trees: a recent overview.” Artificial Intelligence Review,39(4), Apr, pp. 261–283. 3
[7] Csurka, G., Dance, C. R., Fan, L., Willamowski, J., and Bray, C., 2004. “Visual categoriza-tion with bags of keypoints.” In In Workshop on Statistical Learning in Computer Vision,ECCV, pp. 1–22. 3
[8] Guo, G., Wang, H., Bell, D., Bi, Y., and Greer, K., 2003. “Knn model-based approachin classification.” In On The Move to Meaningful Internet Systems 2003: CoopIS, DOA,and ODBASE, R. Meersman, Z. Tari, and D. C. Schmidt, eds., Springer Berlin Heidelberg,pp. 986–996. 3
[9] MacQueen, J., 1967. “Some methods for classification and analysis of multivariate obser-vations.” In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics andProbability, Volume 1: Statistics, University of California Press, pp. 281–297. 3
[10] Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P., 1998. “Gradient-based learning appliedto document recognition.” Proceedings of the IEEE, 86(11), Nov, pp. 2278–2324. 4, 81, 85
[11] Krizhevsky, A., Sutskever, I., and Hinton, G. E., 2012. “Imagenet classification withdeep convolutional neural networks.” In Proceedings of the 25th International Conferenceon Neural Information Processing Systems - Volume 1, NIPS’12, Curran Associates Inc.,pp. 1097–1105. 4, 12
92
[12] Simonyan, K., and Zisserman, A., 2014. “Very deep convolutional networks for large-scaleimage recognition.” arXiv 1409.1556, 09. 4
[13] He, K., Gkioxari, G., Dollar, P., and Girshick, R. B., 2017. “Mask R-CNN.” CoRR,abs/1703.06870. 4
[14] Huang, G., Liu, Z., Weinberger, K. Q., and van der Maaten, L., 2016. “Densely connectedconvolutional networks.” 2017 IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), pp. 2261–2269. 4, 69
[15] Xie, S., Girshick, R., Dollar, P., Tu, Z., and He, K., 2017. “Aggregated residual transforma-tions for deep neural networks.” pp. 5987–5995. 4
[16] Lillywhite, K., Lee, D.-J., Tippetts, B., and Archibald, J., 2013. “A feature constructionmethod for general object recognition.” Pattern Recognition, 46(12), pp. 3300 – 3314. 4, 9,12, 13, 16
[17] Florack, L., Ter Haar Romeny, B., Viergever, M., and Koenderink, J., 1996. “The gaussianscale-space paradigm and the multiscale local jet.” Int. J. Comput. Vision, 18(1), apr, pp. 61–75. 5, 10, 12
[18] Kim, M., and Smaragdis, P., 2018. “Bitwise Neural Networks for Efficient Single-ChannelSource Separation.” In 2018 IEEE International Conference on Acoustics, Speech and Sig-nal Processing (ICASSP), Vol. 2018-April, IEEE, pp. 701–705. 5, 56
[19] Lillywhite, K., Tippetts, B., and Lee, D.-J., 2012. “Self-tuned evolution-constructed featuresfor general object recognition.” Pattern Recognition, 45(1), pp. 241 – 251. 8, 9, 12, 16, 75
[20] Lillholm, M., and Pedersen, K. S., 2004. “Jet based feature classification.” Proceedings ofthe 17th International Conference on Pattern Recognition, 2004. ICPR 2004., 2, pp. 787–790 Vol.2. 12
[21] Larsen, A. B. L., Darkner, S., Dahl, A. L., and Pedersen, K. S., 2012. “Jet-based local imagedescriptors.” In Computer Vision – ECCV 2012, A. Fitzgibbon, S. Lazebnik, P. Perona,Y. Sato, and C. Schmid, eds., Springer Berlin Heidelberg, pp. 638–650. 12
[22] Manzanera, A., 2011. “Local jet feature space framework for image processing and repre-sentation.” In 2011 Seventh International Conference on Signal Image Technology Internet-Based Systems, pp. 261–268. 12
[23] Breiman, L., 2001. “Random forests.” Machine Learning, 45(1), Oct, pp. 5–32. 13
[24] Zhu, J., Rosset, S., Zou, H., and Hastie, T., 2006. “Multi-class adaboost.” Statistics and itsinterface, 2, 02. 15
[25] Freund, Y., and Schapire, R. E., 1997. “A decision-theoretic generalization of on-line learn-ing and an application to boosting.” J. Comput. Syst. Sci., 55(1), aug, pp. 119–139. 15
[26] Zhang, M., Lee, D.-J., Lillywhite, K., and Tippetts, B., 2017. “Automatic quality and mois-ture evaluations using evolution constructed features.” Computers and Electronics in Agri-culture, 135, pp. 321 – 327. 16
93
[27] Prost-Boucle, A., Bourge, A., Petrot, F., Alemdar, H., Caldwell, N., and Leroy, V., 2017.“Scalable high-performance architecture for convolutional ternary neural networks on fpga.”In 2017 27th International Conference on Field Programmable Logic and Applications(FPL), pp. 1–7. 29
[28] Courbariaux, M., and Bengio, Y., 2016. “Binarynet: Training deep neural networks withweights and activations constrained to +1 or -1.” CoRR, abs/1602.02830. 33, 37, 38, 39,41, 42, 43, 44, 49, 53, 56, 57, 59, 60, 61, 81
[29] Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto,M., and Adam, H., 2017. “Mobilenets: Efficient convolutional neural networks for mobilevision applications.” CoRR, abs/1704.04861. 35, 67
[30] Jaderberg, M., Vedaldi, A., and Zisserman, A., 2014. “Speeding up convolutional neuralnetworks with low rank expansions.” CoRR, abs/1405.3866. 35
[31] Chen, Y., Wang, N., and Zhang, Z., 2017. “Darkrank: Accelerating deep metric learning viacross sample similarities transfer.” CoRR, abs/1707.01220. 35
[32] Iandola, F. N., Moskewicz, M. W., Ashraf, K., Han, S., Dally, W. J., and Keutzer, K., 2016.“Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size.”CoRR, abs/1602.07360. 35
[33] Hanson, S. J., and Pratt, L., 1989. “Advances in neural information processing systems1.” Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ch. Comparing Biases forMinimal Network Construction with Back-propagation, pp. 177–185. 35
[34] Cun, Y. L., Denker, J. S., and Solla, S. A., 1990. “Advances in neural information processingsystems 2.” Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ch. Optimal BrainDamage, pp. 598–605. 35
[35] Han, S., Mao, H., and Dally, W. J., 2016. “Deep compression: Compressing deep neuralnetwork with pruning, trained quantization and huffman coding.” CoRR, abs/1510.00149.35
[36] Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P., 2015. “Deep Learning withLimited Numerical Precision.”. 36
[37] Courbariaux, M., Bengio, Y., and David, J.-P., 2014. “Training deep neural networks withlow precision multiplications.” pp. 1–10. 36
[38] Zhou, S., Ni, Z., Zhou, X., Wen, H., Wu, Y., and Zou, Y., 2016. “Dorefa-net: Train-ing low bitwidth convolutional neural networks with low bitwidth gradients.” CoRR,abs/1606.06160. 36, 44, 45, 51, 54, 60, 62
[39] Seo, J., Yu, J., Lee, J., and Choi, K., 2016. “A new approach to binarizing neural networks.”In 2016 International SoC Design Conference (ISOCC), IEEE, pp. 77–78. 37, 60
94
[40] Yonekawa, H., Sato, S., and Nakahara, H., 2018. “A Ternary Weight Binary Input Convo-lutional Neural Network: Realization on the Embedded Processor.” In 2018 IEEE 48th In-ternational Symposium on Multiple-Valued Logic (ISMVL), Vol. 2018-May, IEEE, pp. 174–179. 37, 61
[41] Hwang, K., and Sung, W., 2014. “Fixed-point feedforward deep neural network designusing weights +1, 0, and -1.” IEEE Workshop on Signal Processing Systems, SiPS: Designand Implementation, pp. 1–6. 37, 60
[42] Prost-Boucle, A., Bourge, A., and Petrot, F., 2018. “High-Efficiency Convolutional TernaryNeural Networks with Custom Adder Trees and Weight Compression.” ACM Transactionson Reconfigurable Technology and Systems, 11(3), dec, pp. 1–24. 37, 60, 61
[43] Saad, D., and Marom, E., 1990. “Training feed forward nets with binary weights via amodified chir algorithm.” Complex Systems, 4, 01. 37
[44] Baldassi, C., Braunstein, A., Brunel, N., and Zecchina, R., 2007. “Efficient supervisedlearning in networks with binary synapses.” Proceedings of the National Academy of Sci-ences, 104(26), pp. 11079–11084. 37
[45] Soudry, D., Hubara, I., and Meir, R., 2014. “Expectation Backpropagation: parameter-freetraining of multilayer neural networks with real and discrete weights.” Advances in NeuralInformation Processing Systems, 2(1), pp. 963—-971. 37, 56
[46] Courbariaux, M., Bengio, Y., and David, J.-P., 2015. “BinaryConnect: Training Deep NeuralNetworks with binary weights during propagations.” In NIPS, pp. 3123–3131. 37, 38, 43,60, 61
[47] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, R. F., 2013. “DropConnect.” Interna-tional Conference on Machine Learning. 37
[48] Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y., 2016. “BinarizedNeural Networks.” In NIPS, pp. 1–9. 37, 38, 43, 44, 62
[49] Ding, R., Liu, Z., Shi, R., Marculescu, D., and Blanton, R. S., 2017. “LightNN.” InProceedings of the on Great Lakes Symposium on VLSI 2017 - GLSVLSI ’17, ACM Press,pp. 35–40. 37
[50] Ding, R., Liu, Z., Blanton, R. D. S., and Marculescu, D., 2018. “Lightening the Loadwith Highly Accurate Storage- and Energy-Efficient LightNNs.” ACM Transactions onReconfigurable Technology and Systems, 11(3), dec, pp. 1–24. 37
[51] Bengio, Y., Leonard, N., and Courville, A., 2013. “Estimating or Propagating GradientsThrough Stochastic Neurons for Conditional Computation.” pp. 1–12. 38
[52] Lin, X., Zhao, C., and Pan, W., 2017. “Towards Accurate Binary Convolutional NeuralNetwork.” In NIPS. 42, 46, 51, 52, 62
[53] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus,R., 2014. “Intriguing properties of neural networks.” CoRR, abs/1312.6199. 42
95
[54] Moosavi-Dezfooli, S., Fawzi, A., Fawzi, O., and Frossard, P., 2016. “Universal adversarialperturbations.” CoRR, abs/1610.08401. 42
[55] Galloway, A., Taylor, G. W., and Moussa, M., 2017. “Attacking binarized neural networks.”CoRR, abs/1711.00449. 42
[56] Khalil, E. B., Gupta, A., and Dilkina, B., 2018. “Combinatorial attacks on binarized neuralnetworks.” CoRR, abs/1810.03538. 42
[57] Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y., 2017. “Quantizedneural networks: Training neural networks with low precision weights and activations.” J.Mach. Learn. Res., 18(1), jan, pp. 6869–6898. 44, 62
[58] Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A., 2016. “XNOR-Net: ImageNetClassification Using Binary Convolutional Neural Networks.” In ECCV. pp. 525–542. 44,51, 62, 67
[59] Kanemura, A., Sawada, H., Wakisaka, T., and Hano, H., 2017. “Experimental explorationof the performance of binary networks.” In 2017 IEEE 2nd International Conference onSignal and Image Processing (ICSIP), Vol. 2017-Janua, IEEE, pp. 451–455. 44, 61
[60] Tang, W., Hua, G., and Wang, L., 2017. “How to Train a Compact Binary Neural Networkwith High Accuracy?.” AAAI. 45, 46, 47, 51, 52, 53, 55, 62
[61] Darabi, S., Belbahri, M., Courbariaux, M., and Nia, V. P., 2018. “BNN+: Improved BinaryNetwork Training.” Seventh International Conference on Learning Representations, dec,pp. 1–10. 51, 53, 61, 62
[62] Ghasemzadeh, M., Samragh, M., and Koushanfar, F., 2018. “ReBNet: Residual Bina-rized Neural Network.” In 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), IEEE, pp. 57–64. 52, 59, 60, 61, 62,63, 64, 65, 75
[63] Prabhu, A., Batchu, V., Gajawada, R., Munagala, S. A., and Namboodiri, A., 2018. “HybridBinary Networks: Optimizing for Accuracy, Efficiency and Memory.” In 2018 IEEE WinterConference on Applications of Computer Vision (WACV), Vol. 2018-Janua, IEEE, pp. 821–829. 53, 62
[64] Wang, H., Xu, Y., Ni, B., Zhuang, L., and Xu, H., 2018. “Flexible Network Binarizationwith Layer-Wise Priority.” In 2018 25th IEEE International Conference on Image Process-ing (ICIP), IEEE, pp. 2346–2350. 53
[65] Zhao, R., Song, W., Zhang, W., Xing, T., Lin, J.-H., Srivastava, M., Gupta, R., andZhang, Z., 2017. “Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs.” In Proceedings of the 2017 ACM/SIGDA International Symposiumon Field-Programmable Gate Arrays - FPGA ’17, ACM Press, pp. 15–24. 54, 61, 63, 64,65, 75
96
[66] Guo, P., Ma, H., Chen, R., Li, P., Xie, S., and Wang, D., 2018. “FBNA: A Fully Bina-rized Neural Network Accelerator.” In 2018 28th International Conference on Field Pro-grammable Logic and Applications (FPL), IEEE, pp. 51–513. 54, 60, 61, 63, 64, 65, 75
[67] Fraser, N. J., Umuroglu, Y., Gambardella, G., Blott, M., Leong, P., Jahre, M., and Vissers,K., 2017. “Scaling Binarized Neural Networks on Reconfigurable Logic.” In Proceedings ofthe 8th Workshop and 6th Workshop on Parallel Programming and Run-Time ManagementTechniques for Many-core Architectures and Design Tools and Architectures for MulticoreEmbedded Computing Platforms - PARMA-DITAM ’17, ACM Press, pp. 25–30. 54, 61, 64,65, 67
[68] Umuroglu, Y., Fraser, N. J., Gambardella, G., Blott, M., Leong, P., Jahre, M., and Vissers,K., 2017. “FINN.” In Proceedings of the 2017 ACM/SIGDA International Symposium onField-Programmable Gate Arrays - FPGA ’17, ACM Press, pp. 65–74. 56, 59, 60, 61, 63,64, 65
[69] Song, D., Yin, S., Ouyang, P., Liu, L., and Wei, S., 2018. “Low Bits: Binary Neural Networkfor Vad and Wakeup.” In 2018 5th International Conference on Information Science andControl Engineering (ICISCE), IEEE, pp. 306–311. 56
[70] Yin, S., Ouyang, P., Zheng, S., Song, D., Li, X., Liu, L., and Wei, S., 2018. “A 141 UW, 2.46PJ/Neuron Binarized Convolutional Neural Network Based Self-Learning Speech Recogni-tion Processor in 28NM CMOS.” In 2018 IEEE Symposium on VLSI Circuits, Vol. 2018-June, IEEE, pp. 139–140. 56, 64
[71] Li, Y., Liu, Z., Liu, W., Jiang, Y., Wang, Y., Goh, W. L., Yu, H., and Ren, F., 2018. “A34-FPS 698-GOP/s/W Binarized Deep Neural Network-based Natural Scene Text Interpre-tation Accelerator for Mobile Edge Computing.” IEEE Transactions on Industrial Electron-ics, PP(c), pp. 1–1. 56, 64
[72] Bulat, A., and Tzimiropoulos, G., 2017. “Binarized Convolutional Landmark Localizersfor Human Pose Estimation and Face Alignment with Limited Resources.” In 2017 IEEEInternational Conference on Computer Vision (ICCV), Vol. 2017-Octob, IEEE, pp. 3726–3734. 56
[73] Ma, C., Guo, Y., Lei, Y., and An, W., 2019. “Binary Volumetric Convolutional NeuralNetworks for 3-D Object Recognition.” IEEE Transactions on Instrumentation and Mea-surement, 68(1), jan, pp. 38–48. 56
[74] Eccv, A., 2018. Efficient Super Resolution Using Binarized Neural Network. 56
[75] Bulat, A., and Tzimiropoulos, Y., 2018. “Hierarchical binary CNNs for landmark localiza-tion with limited resources.” IEEE Transactions on Pattern Analysis and Machine Intelli-gence, PP(8), pp. 1–1. 56
[76] Say, B., and Sanner, S., 2018. “Planning in factored state and action spaces with learnedbinarized neural network transition models.” IJCAI International Joint Conference on Arti-ficial Intelligence, 2018-July, pp. 4815–4821. 56
97
[77] Chi, C.-C., and Jiang, J.-H. R., 2018. “Logic synthesis of binarized neural networks for effi-cient circuit implementation.” In Proceedings of the International Conference on Computer-Aided Design - ICCAD ’18, ACM Press, pp. 1–7. 59, 61
[78] Narodytska, N., Ryzhyk, L., and Walsh, T., 2018. “Verifying Properties of Binarized DeepNeural Networks.” pp. 6615–6624. 59
[79] Yang, H., Fritzsche, M., Bartz, C., and Meinel, C., 2017. “BMXNet: An Open-SourceBinary Neural Network Implementation Based on MXNet.” Workshop: Proceedings ofNew Security Paradigms, may. 59, 61
[80] Blott, M., Preußer, T. B., Fraser, N. J., Gambardella, G., O’brien, K., Umuroglu, Y., Leeser,M., and Vissers, K., 2018. “FINN-R: An End-to-End Deep-Learning Framework for FastExploration of Quantized Neural Networks.” ACM Transactions on Reconfigurable Tech-nology and Systems, 11(3), dec, pp. 1–23. 59, 60, 61, 63, 64, 65, 75
[81] McDanel, B., Teerapittayanon, S., and Kung, H. T., 2017. “Embedded Binarized NeuralNetworks.” pp. 168–173. 59, 61
[82] Jokic, P., Emery, S., and Benini, L., 2018. “BinaryEye: A 20 kfps Streaming Camera Systemon FPGA with Real-Time On-Device Image Recognition Using Binary Neural Networks.”In 2018 IEEE 13th International Symposium on Industrial Embedded Systems (SIES), IEEE,pp. 1–7. 59, 64, 65
[83] Valavi, H., Ramadge, P. J., Nestler, E., and Verma, N., 2018. “A Mixed-Signal BinarizedConvolutional-Neural-Network Accelerator Integrating Dense Weight Storage and Multipli-cation for Reduced Data Movement.” In 2018 IEEE Symposium on VLSI Circuits, Vol. 2018-June, IEEE, pp. 141–142. 59, 61, 64
[84] Kim, M., Smaragdis, P., and Edu, P. I., 2016. “Bitwise Neural Networks.”. 59, 60
[85] Sun, X., Yin, S., Peng, X., Liu, R., Seo, J.-s., and Yu, S., 2018. “XNOR-RRAM: A scalableand parallel resistive synaptic architecture for binary neural networks.” In 2018 Design,Automation & Test in Europe Conference & Exhibition (DATE), Vol. 2018-Janua, IEEE,pp. 1423–1428. 59, 61, 66
[86] Yu, S., Li, Z., Chen, P.-Y., Wu, H., Gao, B., Wang, D., Wu, W., and Qian, H., 2016. “Binaryneural network with 16 Mb RRAM macro chip for classification and online training.” In2016 IEEE International Electron Devices Meeting (IEDM), IEEE, pp. 16.2.1–16.2.4. 60,66
[87] Zhou, Y., Redkar, S., and Huang, X., 2017. “Deep learning binary neural network on anFPGA.” In 2017 IEEE 60th International Midwest Symposium on Circuits and Systems(MWSCAS), Vol. 2017-Augus, IEEE, pp. 281–284. 61, 63, 65
[88] Nakahara, H., Fujii, T., and Sato, S., 2017. “A fully connected layer elimination for a bi-narizec convolutional neural network on an FPGA.” In 2017 27th International Conferenceon Field Programmable Logic and Applications (FPL), IEEE, pp. 1–4. 61, 63, 65
98
[89] Yang, L., He, Z., and Fan, D., 2018. “A Fully Onchip Binarized Convolutional NeuralNetwork FPGA Impelmentation with Accurate Inference.” Proceedings of the InternationalSymposium on Low Power Electronics and Design, pp. 50:1—-50:6. 61, 63, 64, 65
[90] Bankman, D., Yang, L., Moons, B., Verhelst, M., and Murmann, B., 2019. “An Always-On 3.8 micro J/86% CIFAR-10 Mixed-Signal Binary CNN Processor With All Memory onChip in 28-nm CMOS.” IEEE Journal of Solid-State Circuits, 54(1), jan, pp. 158–172. 61,64
[91] Rusci, M., Rossi, D., Flamand, E., Gottardi, M., Farella, E., and Benini, L., 2018. “Always-ON visual node with a hardware-software event-based binarized neural network inferenceengine.” In Proceedings of the 15th ACM International Conference on Computing Frontiers- CF ’18, no. 1, ACM Press, pp. 314–319. 61
[92] Ding, R., Liu, Z., Blanton, R. D. S., and Marculescu, D., 2018. “Quantized Deep NeuralNetworks for Energy Efficient Hardware-based Inference.” pp. 1–8. 62
[93] Ling, Y., Zhong, K., Wu, Y., Liu, D., Ren, J., Liu, R., Duan, M., Liu, W., and Liang, L.,2018. “TaiJiNet: Towards Partial Binarized Convolutional Neural Network for EmbeddedSystems.” In 2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Vol. 2018-July, IEEE, pp. 136–141. 62
[94] Yonekawa, H., and Nakahara, H., 2017. “On-Chip Memory Based Binarized ConvolutionalDeep Neural Network Applying Batch Normalization Free Technique on an FPGA.” In 2017IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW),IEEE, pp. 98–105. 62, 63, 64, 65, 66, 75
[95] Rybalkin, V., Pappalardo, A., Ghaffar, M. M., Gambardella, G., Wehn, N., and Blott, M.,2018. “FINN-L: Library Extensions and Design Trade-Off Analysis for Variable Preci-sion LSTM Networks on FPGAs.” In 2018 28th International Conference on Field Pro-grammable Logic and Applications (FPL), IEEE, pp. 89–897. 63, 64
[96] Nakahara, H., Yonekawa, H., Sasao, T., Iwamoto, H., and Motomura, M., 2016. “Amemory-based realization of a binarized deep convolutional neural network.” In 2016 In-ternational Conference on Field-Programmable Technology (FPT), IEEE, pp. 277–280. 63,65
[97] Nakahara, H., Yonekawa, H., Fujii, T., and Sato, S., 2018. “A Lightweight YOLOv2.”In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-ProgrammableGate Arrays - FPGA ’18, ACM Press, pp. 31–40. 63
[98] Faraone, J., Fraser, N., Blott, M., and Leong, P. H. W., 2018. “SYQ: Learning SymmetricQuantization For Efficient Deep Neural Networks.”. 64
[99] Nurvitadhi, E., Sheffield, D., Jaewoong Sim, Mishra, A., Venkatesh, G., and Marr, D., 2016.“Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC.”In 2016 International Conference on Field-Programmable Technology (FPT), IEEE, pp. 77–84. 64, 66
99
[100] Jafari, A., Hosseini, M., Kulkarni, A., Patel, C., and Mohsenin, T., 2018. “BiNMAC.” InProceedings of the 2018 on Great Lakes Symposium on VLSI - GLSVLSI ’18, ACM Press,pp. 443–446. 64
[101] Bahou, A. A., Karunaratne, G., Andri, R., Cavigelli, L., and Benini, L., 2018. “XNORBIN:A 95 TOp/s/W hardware accelerator for binary convolutional neural networks.” In 2018IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS), IEEE, pp. 1–3. 64
[102] Rusci, M., Cavigelli, L., and Benini, L., 2018. “Design Automation for Binarized NeuralNetworks: A Quantum Leap Opportunity?.” In 2018 IEEE International Symposium onCircuits and Systems (ISCAS), Vol. 732631, IEEE, pp. 1–5. 66
[103] Sun, X., Liu, R., Peng, X., and Yu, S., 2018. “Computing-in-Memory with SRAM andRRAM for Binary Neural Networks.” In 2018 14th IEEE International Conference onSolid-State and Integrated Circuit Technology (ICSICT), IEEE, pp. 1–4. 66
[104] Choi, W., Jeong, K., Choi, K., Lee, K., and Park, J., 2018. “Content addressable mem-ory based binarized neural network accelerator using time-domain signal processing.” InProceedings of the 55th Annual Design Automation Conference on - DAC ’18, ACM Press,pp. 1–6. 66
[105] Angizi, S., and Fan, D., 2017. “IMC: Energy -Efficient In-Memory Concvolver for Accel-erating Binarized Deep Neural Networks.” In Proceedings of the Neuromorphic ComputingSymposium on - NCS ’17, no. 1, ACM Press, pp. 1–8. 66
[106] Liu, R., Peng, X., Sun, X., Khwa, W.-S., Si, X., Chen, J.-J., Li, J.-F., Chang, M.-F., andYu, S., 2018. “Parallelizing SRAM Arrays with Customized Bit-Cell for Binary NeuralNetworks.” In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), IEEE,pp. 1–6. 66
[107] Zhou, Z., Huang, P., Xiang, Y. C., Shen, W. S., Zhao, Y. D., Feng, Y. L., Gao, B., Wu,H. Q., Qian, H., Liu, L. F., Zhang, X., Liu, X. Y., and Kang, J. F., 2018. “A new hardwareimplementation approach of BNNs based on nonlinear 2T2R synaptic cell.” In 2018 IEEEInternational Electron Devices Meeting (IEDM), IEEE, pp. 20.7.1–20.7.4. 66
[108] Tang, T., Xia, L., Li, B., Wang, Y., and Yang, H., 2017. “Binary convolutional neuralnetwork on rram.” In 2017 22nd Asia and South Pacific Design Automation Conference(ASP-DAC), pp. 782–787. 66
[109] Yang, L., He, Z., and Fan, D., 2019. “Binarized depthwise separable neural network forobject tracking in fpga.” In Proceedings of the 2019 on Great Lakes Symposium on VLSI,GLSVLSI ’19, Association for Computing Machinery, p. 347–350. 69
[110] He, K., Zhang, X., Ren, S., and Sun, J., 2016. “Deep residual learning for image recog-nition.” In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),pp. 770–778. 71
[111] Simons, T., and Lee, D.-J., 2019. “Jet features: Hardware-friendly, learned convolutionalkernels for high-speed image classification.” Electronics, 8(5). 75, 81
100