110
Brigham Young University Brigham Young University BYU ScholarsArchive BYU ScholarsArchive Theses and Dissertations 2021-06-16 High-Speed Image Classification for Resource-Limited Systems High-Speed Image Classification for Resource-Limited Systems Using Binary Values Using Binary Values Taylor Scott Simons Brigham Young University Follow this and additional works at: https://scholarsarchive.byu.edu/etd Part of the Engineering Commons BYU ScholarsArchive Citation BYU ScholarsArchive Citation Simons, Taylor Scott, "High-Speed Image Classification for Resource-Limited Systems Using Binary Values" (2021). Theses and Dissertations. 9097. https://scholarsarchive.byu.edu/etd/9097 This Dissertation is brought to you for free and open access by BYU ScholarsArchive. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of BYU ScholarsArchive. For more information, please contact [email protected].

High-Speed Image Classification for Resource-Limited

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Brigham Young University Brigham Young University

BYU ScholarsArchive BYU ScholarsArchive

Theses and Dissertations

2021-06-16

High-Speed Image Classification for Resource-Limited Systems High-Speed Image Classification for Resource-Limited Systems

Using Binary Values Using Binary Values

Taylor Scott Simons Brigham Young University

Follow this and additional works at: https://scholarsarchive.byu.edu/etd

Part of the Engineering Commons

BYU ScholarsArchive Citation BYU ScholarsArchive Citation Simons, Taylor Scott, "High-Speed Image Classification for Resource-Limited Systems Using Binary Values" (2021). Theses and Dissertations. 9097. https://scholarsarchive.byu.edu/etd/9097

This Dissertation is brought to you for free and open access by BYU ScholarsArchive. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of BYU ScholarsArchive. For more information, please contact [email protected].

High-Speed Image Classification

For Resource-Limited Systems

Using Binary Values

Taylor Scott Simons

A dissertation submitted to the faculty ofBrigham Young University

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

Dah-Jye Lee, ChairRandal Beard

Jeffery GoedersDavid Long

Department of Electrical and Computer Engineering

Brigham Young University

Copyright © 2021 Taylor Scott Simons

All Rights Reserved

ABSTRACT

High-Speed Image ClassificationFor Resource-Limited Systems

Using Binary Values

Taylor Scott SimonsDepartment of Electrical and Computer Engineering, BYU

Doctor of Philosophy

Image classification is a memory- and compute-intensive task. It is difficult to implementhigh-speed image classification algorithms on resource-limited systems like FPGAs and embeddedcomputers. Most image classification algorithms require many fixed- and/or floating-point oper-ations and values. In this work, we explore the use of binary values to reduce the memory andcompute requirements of image classification algorithms. Our objective was to implement thesealgorithms on resource-limited systems while maintaining comparable accuracy and high speeds.By implementing high-speed image classification algorithms on resource-limited systems like em-bedded computers, FPGAs, and ASICs, automated visual inspection can be performed on smalllow-powered systems. Industries like manufacturing, medicine, and agriculture can benefit fromcompact, high-speed, low-power visual inspection systems. Tasks like defect detection in man-ufactured products and quality sorting of harvested produce can be performed cheaper and morequickly. In this work, we present ECO Jet Features, an algorithm adapted to use binary values forvisual inspection. The ECO Jet Features algorithm ran 3.7× faster than the original ECO Featuresalgorithm on embedded computers. It also allowed the algorithm to be implemented on an FPGA,achieving 78× speedup over full-sized desktop systems, using a fraction of the power and space.We reviewed Binarized Neural Nets (BNNs), neural networks that use binary values for weightsand activations. These networks are particularly well suited for FPGA implementation and wecompared and contrasted various FPGA implementations found throughout the literature. Finally,we combined the deep learning methods used in BNNs with the efficiency of Jet Features to makeNeural Jet Features. Neural Jet Features are binarized convolutional layers that are learned throughdeep learning and learn classic computer vision kernels like the Gaussian and Sobel kernels. Thesekernels are efficiently computed as a group and their outputs can be reused when forming outputchannels. They performed just as well as BNN convolutions on visual inspection tasks and aremore stable when trained on small models.

Keywords: image classification, computer vision, FPGA, embedded systems, neural networks

ACKNOWLEDGMENTS

I would like to thank and acknowledge my advisor, Dr. Lee. He has always believed in and

supported me throughout this process. Through his example, he has helped me become a better

student, researcher, and person. I would like to thank my wife and children for filling my life with

joy. Taylor, Joan, and Harvey have been a foundation of comfort and reassurance that I have relied

on many times throughout graduate school and life. I also thank my parents for the many years

they have devoted to raising me, teaching me, and caring for me. I would like to thank all my

friends and classmates that helped me along the way. I would like to thank (and apologize to) all

my professors here at BYU that have inspired me and put up with me, especially when I thought I

was smarter than I actually was.

TABLE OF CONTENTS

TITLE PAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Traditional Image Classification . . . . . . . . . . . . . . . . . . . . . . . 31.1.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.1 ECO Jet Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.2 Binarized Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.3 Scaling Down Binarized Neural Networks . . . . . . . . . . . . . . . . . . 61.2.4 Neural Jet Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Chapter 2 Jet Features: Hardware-Friendly, Learned Convolutional Kernels forHigh-Speed Image Classification . . . . . . . . . . . . . . . . . . . . . . . 8

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Jet Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Introduction to Jet Features . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.2 Multiscale Local Jets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 The ECO Features Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 The ECO Jet Features Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4.1 Jet Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4.2 Advantages in Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4.3 Advantages in Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5 Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.5.1 The Jet Features Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.5.2 The Random Forest Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.6.2 Accuracy on MNIST CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . 252.6.3 Accuracy on BYU Fish dataset . . . . . . . . . . . . . . . . . . . . . . . . 252.6.4 Software Speed Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 262.6.5 Hardware Implementation Results . . . . . . . . . . . . . . . . . . . . . . 29

2.7 Smart Camera System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

iv

2.7.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.7.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Chapter 3 A Review of Binarized Neural Networks . . . . . . . . . . . . . . . . . . . 333.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3.1 Network Quantization Techniques . . . . . . . . . . . . . . . . . . . . . . 363.3.2 Early Binarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.4 An Introduction to BNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.4.1 Binarization of Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.4.2 Binarization of Activations . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4.3 Bitwise Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.4.4 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.4.5 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.4.6 Robustness to Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.5 Major BNN Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.5.1 The Original BNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.5.2 XNOR-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.5.3 DoReFa-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.5.4 Tang et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.5.5 ABC-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.5.6 BNN+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.5.7 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.6 Improving BNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.6.1 Scaling with a Gain Term . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.6.2 Using Multiple Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.6.3 Partial Binarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.6.4 Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.6.5 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.6.6 More Binarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.6.7 Batch Normalization and Activations as a Threshold . . . . . . . . . . . . 553.6.8 Layer Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.7 Comparison of Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.7.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.7.2 Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.7.3 Table of Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.8 Hardware Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.8.1 FPGA Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.8.2 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.8.3 High-Level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643.8.4 Comparison of FPGA Implementations . . . . . . . . . . . . . . . . . . . 643.8.5 ASICs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

v

Chapter 4 Using Full Precision Methods to Scale Down Binarized Neural Networks . 674.1 Depthwise Separable Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.1.1 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.1.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2 Direct Skip Connections in BNNs . . . . . . . . . . . . . . . . . . . . . . . . . . 694.2.1 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.2.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Chapter 5 Neural Jet Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.2 Neural Jet Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.2.1 Constrained Neural Jet Features . . . . . . . . . . . . . . . . . . . . . . . 775.2.2 Jet Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.2.3 Computational Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.3.1 Model Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.3.2 BYU Fish Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.3.3 BYU Cookie Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.3.4 MNIST Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Chapter 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886.2 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

vi

LIST OF TABLES

2.1 ECO Feature Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2 ECO Jet Feature FPGA Resource Usage . . . . . . . . . . . . . . . . . . . . . . . . . 282.3 ECO Feature FPGA Resource Usage Per Unit . . . . . . . . . . . . . . . . . . . . . . 29

3.1 XNOR Operation Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.2 Accuracy of BNNs on the CIFAR-10 Dataset . . . . . . . . . . . . . . . . . . . . . . 493.3 Accuracy of BNNs on the ImageNet Dataset . . . . . . . . . . . . . . . . . . . . . . . 503.4 Overview of BNN Models and Methods . . . . . . . . . . . . . . . . . . . . . . . . . 503.5 Accuracy of BNNs on the MNIST Dataset . . . . . . . . . . . . . . . . . . . . . . . . 593.6 Accuracy of Non-binary Models on the MNIST Dataset . . . . . . . . . . . . . . . . . 603.7 Accuracy of BNNs on the SVHN Dataset . . . . . . . . . . . . . . . . . . . . . . . . 603.8 Accuracy of Non-binary Models on the SVHN Dataset . . . . . . . . . . . . . . . . . 603.9 Accuracy of BNNs on the CIFAR-10 Dataset . . . . . . . . . . . . . . . . . . . . . . 613.10 Accuracy of Non-binary Models on the CIFAR-10 Dataset . . . . . . . . . . . . . . . 613.11 Accuracy of BNNs on the ImageNet Dataset . . . . . . . . . . . . . . . . . . . . . . . 623.12 Comparison of BNNs Implemented on FPGAs . . . . . . . . . . . . . . . . . . . . . . 65

5.1 Layer Sizes for Neural Jet Feature Models . . . . . . . . . . . . . . . . . . . . . . . . 82

vii

LIST OF FIGURES

1.1 Computational and Memory Cost of Image Classification Systems . . . . . . . . . . . 2

2.1 Example of a Separable Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Jet Feature Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Jet Feature Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Example of ECO Feature Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5 ECO Feature Mutation and Crossover . . . . . . . . . . . . . . . . . . . . . . . . . . 142.6 ECO Feature Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.7 The Original Features Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.8 Eco Jet Mutation and Crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.9 ECO Jet Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.10 Single-Buffer Convolution Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.11 ECO Jet Accuracy vs Jet Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.12 ECO Jet Features Accuracy vs Node Count . . . . . . . . . . . . . . . . . . . . . . . 232.13 Examples from the MNIST and CIFAR-10 datasets . . . . . . . . . . . . . . . . . . . 242.14 Examples from the BYU Fish dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.15 ECO Features Accuracy on the CIFAR-10 Dataset . . . . . . . . . . . . . . . . . . . . 262.16 ECO Features Accuracy on the MNIST Dataset . . . . . . . . . . . . . . . . . . . . . 272.17 ECO Jet Accuracy on the BYU Fish Dataset . . . . . . . . . . . . . . . . . . . . . . . 282.18 Smart Camera System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1 The Sign Layer/Straight-Through Estimator . . . . . . . . . . . . . . . . . . . . . . . 393.2 Topology of the original Binarized Neural Networks . . . . . . . . . . . . . . . . . . . 58

4.1 Standard Convolution Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.2 Depthwise Separable Convolution Filters . . . . . . . . . . . . . . . . . . . . . . . . . 684.3 Depthwise Separable Convolution Topology . . . . . . . . . . . . . . . . . . . . . . . 704.4 Accuracy of Depthwise Separable BNNs on the CIFAR-10 Dataset . . . . . . . . . . . 714.5 Direct Skip Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.6 Accuracy of Direct Skip Connections on the CIFAR-10 Dataset . . . . . . . . . . . . . 73

5.1 Jet Feature Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.2 The ”Constrained” Neural Jet Feature Kernels . . . . . . . . . . . . . . . . . . . . . . 785.3 Neural Jet Feature Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.4 Neural Jet Feature Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.5 Neural Jet Feature Accuracy on the BYU Fish Dataset . . . . . . . . . . . . . . . . . . 835.6 The BYU Cookie Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.7 Neural Jet Feature Accuracy on the BYU Cookie Dataset . . . . . . . . . . . . . . . . 845.8 Neural Jet Feature Accuracy on the MNIST Dataset - 16 Filter and 32 Dense Units . . 865.9 Neural Jet Feature Accuracy on the MNIST Dataset - 32 Filter and 128 Dense Units . . 865.10 Neural Jet Feature Accuracy on the MNIST Dataset - 64 Filter and 256 Dense Units . . 87

viii

CHAPTER 1. INTRODUCTION

Image classification is one of the most popular computer vision tasks. Many industries like

agriculture, medicine, and manufacturing, are adopting automatic image classification computer

systems to perform visual inspection. In the past, these industries have relied on humans to sort

objects and images. By employing computer systems to sort images, visual inspection can be

performed faster, cheaper, and in environments where human visual inspection cannot be used.

Image classification is usually computationally expensive and requires large amounts of memory.

In this work, we explore ways in which binary values and bitwise operations can be used in place

of fixed- and floating-point value and operations to reduce the computational costs and memory

requirements of image classification systems.

We targeted embedded computer systems and Field Programmable Gate Arrays (FPGAs)

as low powered, compact platforms for high speed image classification. These small form-factor

platforms can be installed in locations where bulky GPU/CPU systems cannot. They achieve high

processing speeds and require less power, but lack the abundance of memory and computational

resources that are afforded by full-sized GPU/CPU systems. By using binary digits, only a sin-

gle bit is required to represent each value, either +1 or −1. These binary values replace higher

precision floating- and fixed-point values. This greatly simplifies the algorithm’s arithmetic opera-

tions and reduces the size of the classification model. Simplified arithmetic and bitwise operations

are especially well suited for FPGAs, which manipulate signals at the bit level. These operations

are also well suited for embedded CPUs where large amounts of floating-point calculations prove

cumbersome.

This work explores the use of binary values in both traditional and modern computer vi-

sion techniques, combining them to take advantage of the elegance of traditional systems and the

convenience and power of modern Deep Learning (DL) methods.

1

1.1 Image Classification

Image classification algorithms begin by processing an image pixel by pixel. From these

pixel level computations, higher levels of abstractions are computed. Algorithms usually require

multiple levels of abstraction to be processed before an input image can be classified, which can

require large amounts of memory. Figure 1.1 illustrates how pixel and local level processing is

computationally expensive and the global classification towards the end of the model is memory

intensive. This applies to both traditional and DL algorithms. Traditional image classification

systems have two distinct parts, a computationally intensive image processing front end and a

memory-intensive classification model on the back end. DL models have a single model that

transitions from low level computations to memory-intensive fully connected layers at the backend.

Both of these concepts are shown in Figure 1.1.

Figure 1.1: Image classification tasks generally require large amounts of memory and compu-tational resources. In general, the first part of image classification algorithms is computationallyintensive and the latter parts are memory intensive. Traditional image classification algorithms splitthese two parts into distinct processes with image processing followed by a classification model.Deep Learning models blend these parts with computationally-intensive early convolutional layersfollowed by memory-intensive fully connected layers.

2

Many mainstream models target general image classification tasks with thousands of pos-

sible classes to choose from [1]. In this work, we focus on visual inspection tasks, rather than

general classification. Visual inspection tasks only need to cover a few different classes while

general classification can classify images into thousands of different classes. Automatic visual

inspection, used in fields like manufacturing and agriculture, requires high-speed image classifi-

cation with fixed camera conditions and a limited number of possible image classifications. Our

work is directed towards these types of applications where speed, power, and size are important

factors while the images being classified have less variance than the diverse set of images that are

often found in most state-of-the-art classification datasets.

We combined the simplicity of traditional image classification algorithms with the con-

venience and power of Machine Learning (ML) and Convolutional Neural Networks (CNN). We

targeted an intersection between these two paradigms through the use of binary values and bitwise

operations instead of floating-point or fixed-point values and operations.

1.1.1 Traditional Image Classification

Traditionally, image classification algorithms begin by using image processing and com-

puter vision to extract key features from an image. These features are fed into ML models that

organize and make sense of these features. Feature descriptors such as SIFT [2], SURF [3],

and ORB [4] have been used in conjunction with support vector machines (SVMs) [5], decision

trees [6], and bag of visual words (BoVWs) models [7].

Discrete convolutions between input images and filter kernels are at the heart of the image

processing used in image classification. Convolution kernels are used for a variety of tasks, such

as determining the scale and orientation of image features and reducing noise in an image. Con-

volution operations, which require many multiplication and addition operations, usually comprise

the majority of the computations within the algorithm.

ML models, used to perform the actual classification, are usually trained from a large set

of example images. These models either require that examples are stored away for reference [8] or

they aggregate the examples together to form a model that can generalize the learned information

well [9]. The classification models generally require the majority of the memory used by the

algorithm.

3

1.1.2 Convolutional Neural Networks

While traditional image classification pipelines separate image processing and classifica-

tion models into distinct segments, convolutional neural networks (CNNs) used a single unified

model that performs both tasks [10]. In CNNs, both the classification model and convolution ker-

nels are learned simultaneously during training. CNNs are the current state-of-the-art for image

classification tasks and have achieved much higher accuracies than traditional methods on most

tasks.

LeCun et al. first proposed CNNs in 1998 with the LeNet model [10] which was capable of

classifying small images of handwritten single digit numbers. It wasn’t until 2012 that CNNs began

to be widely accepted as the state-of-the-art when the AlexNet model [11] won the ImageNet com-

petition [1] which involved 1000 different possible classifications of large photographed images.

Since then, major improvements have been made to CNNs. AlexNet has since been surpassed by

other CNN models like VGG [12], ResNet [13], DenseNet [14] and ResNext [15].

Most CNNs for classification are composed of a series of convolutional layers followed by

fully connected layers. These convolutional layers are composed of many more convolution op-

erations than traditional image processing techniques, and like the image processing in traditional

algorithms, these convolutions require the majority of operations in the model. Fully connected

layers consist of large matrix multiplications. These matrices require the majority of the model’s

memory. The convolution kernels and matrices used in the fully connected layers are both learned

by the deep learning algorithm during training. CNNs are known to require more operations and

memory than traditional image classification methods.

1.2 Overview

1.2.1 ECO Jet Features

Chapter 2 details our efforts to use binary values to reduce the computational cost of an

image classification system targeted at a traditional image classification algorithm, the Evolution

Constructed Features (ECO Features) [16] algorithm. This allowed the algorithm to be easily

implemented on an FPGA for low-power, high-speed image classification. We also showed that our

4

binary convolutions significantly improved the algorithm’s speed on embedded computer systems

and FPGAs, while maintaining classification accuracy.

The ECO Features algorithm uses a traditional image classification pipeline, beginning

with image processing operations followed by ML classification. ECO Features use a genetic

algorithm to construct image processing operations that are then fed into an ML model. The

image processing operations used by this algorithm are not always FPGA friendly nor conducive

to resource-constrained systems. In Chapter 2 we point out that the operations most commonly

selected by the genetic algorithm were convolution filters. We broke these commonly used filters

into a series of smaller convolutions that only used binary weights. We call the filters constructed

this way Jet features since they build up a set scaled partial derivatives similar to n-jet sets [17].

These Jet features include or approximate many popular image filters such as the Gaussian filter,

Sobel filter, and Laplacian filter and can all be calculated simultaneously.

Jet features do not require any multiplication and can be easily implemented to run in

parallel with each other. With Jet features, the algorithm experiences a 3.7× speed up on CPUs

while maintaining the same level of accuracy. In addition, the operations in the algorithm became

simple enough to easily implement in an FPGA, which is not feasible using the original algorithm.

1.2.2 Binarized Neural Networks

In Chapter 3 we review Binarized Neural Networks (BNNs). BNNs are Neural Networks

that use binary values for model weights and neural activations. We outlined the major develop-

ments that have been made to BNNs. We reviewed and summarized the various techniques used to

improve their accuracy. We compared proposed BNN implementations on FPGAs and FPGA/CPU

co-processing systems in terms of accuracy, speed, power, and resources used.

Normally, it is not possible to perform standard training using binary values in CNNs. The

backpropagation method requires the use of continuous values and functions. Initial efforts to

use binary values in CNNs used pre-trained full-precision models, then approximated these full

precision values using binary ones [18]. Chapter 3 outlines the basic methods that allow Neural

Networks to be trained using binary values. We summarized the major developments that have

been made throughout the literature to make BNNs more effective. We also reviewed the various

efforts to implement BNNs on FPGA and FPGA/CPU hybrid systems.

5

1.2.3 Scaling Down Binarized Neural Networks

Most implementations of BNNs throughout the literature either require large FPGAs or

FPGA/CPU hybrid systems. Chapter 4 looks at methods that were originally designed for full

precision neural networks and applies them to BNNs in an attempt to reduce their recourse re-

quirements on FPGAs. Our experiments showed that these methods are not particularly well suited

for BNNs. They reduced the FPGA requirements for FPGA implementations but also hurt their

classification accuracy. These results motivated us to find techniques specifically tailored to BNNs

in order to reduce their size, which we present in Chapter 5.

1.2.4 Neural Jet Features

Chapter 5 introduces Neural Jet Features. Most BNN developments of the last few years

focused on making BNNs more accurate on large, complex datasets, like the ImageNet dataset [1].

These BNNs tended to be large and increasingly expensive. Our goal in this dissertation was to use

binary values to perform high-speed image classification with limited computational resources, like

the ECO Jet Features algorithm. In order to accomplish this, we developed the Neural Jet Features

layer. This convolutional layer can be used to replace the standard BNN convolutional layer with

Jet Features that are learned through deep learning.

The Neural Jet Features layer learns classic computer vision kernels but is trained as a

single end-to-end system. The image features are trained at the same time as the fully connected

classifier, like BNNs, which is not possible in the ECO Features algorithm. This convolutional

layer requires fewer operations and weights than the standard BNN convolutional layer but main-

tains similar accuracy on visual inspection tasks.

1.3 Contributions

This dissertation focuses on reducing the computational and memory costs of image clas-

sification systems through the use of binary values. We developed Jet Features, a novel set of

convolutional kernels that are constructed from smaller binary-valued convolution operations. We

applied Jet Features to the existing ECO Features algorithm which reduced its computational and

6

memory costs and allowed it to be implemented on an FPGA while maintaining comparable clas-

sification accuracy.

BNNs are not new to this work, but we reviewed the existing literature with a special focus

on FPGA implementations. We highlighted the major developments in the field of BNNs and com-

pared the various implementations. We compiled a list of the various FPGA implementations, the

topologies they used, the techniques they employed, the platforms they targeted, and the resources

they required.

Neural Jet Features are unique to this work. They are a combination of our prior efforts

introduced in this work. They combine BNN techniques with Jet Features and allow BNNs to be

implemented in smaller FPGAs without sacrificing accuracy.

7

CHAPTER 2. JET FEATURES: HARDWARE-FRIENDLY, LEARNED CONVOLU-TIONAL KERNELS FOR HIGH-SPEED IMAGE CLASSIFICATION

In this chapter, we present a set of learned convolutional kernels which we call Jet Fea-

tures. Jet Features are convolutional kernels that are formed from a series of convolutions of small

binary-valued convolution kernels. These small binary-valued kernels make Jet Features efficient

to compute in software, easy to implement in hardware, and perform well on visual inspection

tasks. Because Jet Features can be learned, they can be used in machine learning algorithms.

Using Jet Features we make significant improvements on previous work by Lillywhite et al., the

Evolution Constructed Features (ECO Features) algorithm [19]. We gained a 3.7x speedup in soft-

ware without losing any accuracy on the CIFAR-10 and MNIST datasets. Jet Features also allowed

us to implement the algorithm in a Field Programmable Gate Array (FPGA) using only a fraction

of its resources.

2.1 Introduction

The field of computer vision has come a long way in solving the problem of image classi-

fication. Not too long ago, handcrafted convolutional kernels were a staple of all computer vision

algorithms. With the advent of Convolutional Neural Networks(CNNs), handcrafted features have

become the exception rather than the rule, and for good reason. CNNs have taken the field of com-

puter vision to new heights by solving problems that used to be unapproachable or unthinkable.

With deep learning, convolutional kernels can be learned from patterns seen in the data rather than

pre-constructed by algorithm designers.

While CNNs are the most accurate solution to many computer vision tasks, they require

many parameters and many calculations to achieve such accuracy. In this work, we seek to speed

up image classification on simple tasks by leveraging some of the mathematical properties found

in classic handcrafted kernels and applying them in a procedural way with machine learning.

8

Convolutions with Jet Features are efficient to compute in both hardware and software.

They take advantage of redundant calculations during convolution operations and use only the

simplest kernels. We applied these features to our previous machine learning image classification

algorithm, the Evolution Constructed Features (ECO Features) algorithm. We call this new version

of the algorithm the Evolution Constructed Jet Features (ECO Jet Features) algorithm. It was accu-

rate on visual inspection tasks, and can be efficiently run on embedded computer devices without

the need for GPU acceleration. We specifically designed Jet Features to allow the algorithm to be

implemented on an FPGA, which will be referred to as our hardware implementation.

We tested a software implementation and a hardware implementation of our algorithm to

show the speed and compactness of the algorithm. Our hardware design is fully pipelined and

gives visual inspection results as soon as the image reaches the end of the data pipeline. This

hardware architecture was implemented on a Field Programmable Gate Array (FPGA), but could

be integrated into a system on a chip in custom silicon, where it could perform at an even faster

rate while using less power.

The original ECO Features algorithm [19] [16] has been used in various industrial applica-

tions. Many of these applications require high-speed visual inspection, where speed is important

but the identification task is fairly simple. In this work, we speed up the ECO Features algorithm,

allowing it to run 3.7 times faster in a software implementation while maintaining the same level

of accuracy on the MNIST and CIFAR-10 datasets. These improvements also made the algorithm

suitable for full parallelization and pipelining in hardware, which runs over 70 times faster in an

FPGA. The key innovations we present here are the development and use of Jet Features and the

development of a hardware architecture for our design.

2.2 Jet Features

Jet Features are convolutional kernels with special structures that allow for efficient con-

volutions. They are meaningful features in the visual domain and allow for elegant hardware

implementation. In fact, some of the most popular classical handcrafted convolutional kernels

qualify as Jet Features, like the Gaussian, Sobel, and Laplacian kernels. However, Jet Features are

not handcrafted, they are learned features that leverage some of the intuition behind these classic

9

kernels. Mathematically, they are related to multiscale local jets [17], which is reviewed in Section

2.2.2, but we introduce them here in a more informal manner.

2.2.1 Introduction to Jet Features

Jet Features are convolutional kernels that can be separated into a series of very small

kernels. In general, separable kernels are kernels that perform the same operation as a series of

convolutions with smaller kernels. Figure 2.1 shows an example of a 3x3 convolutional kernel that

can be separated into a series of convolutions with a 3x1 kernel and a 1x3 kernel. Jet Features take

separability to an extreme, being separated into the smallest meaningful sized kernels with only 2

elements. Specifically, all Jet Features can be separated into a series of convolutions with kernels

from the set {[1,1], [1,1]T , [1,−1] and [1,−1]T}, which are also shown in Figure 2.2. We refer to

these small kernels as the Jet Feature building blocks. Two of these kernels, [1,1] and [1,1]T , can

be seen as blurring factors or scaling factors. We will refer to them as scaling factors. The other

two kernels, [1,−1] and [1,−1]T , apply a difference between pixels in either the x or y direction

and can be viewed as simple partial derivative operators. We will refer to them as partial derivative

operators. All Jet Features are a series of convolutions with any number of these basic building

blocks. With these building blocks, some of the most popular classic filters can be constructed.

In Figure 2.3 we show how the Gaussian and Sobel filters can be broken down into Jet Feature

building blocks.

Figure 2.1: An example of a separable filter. A 3x3 Gaussian kernel can be separated into twoconvolutions with smaller kernels.

It is important to note that the convolution operation is commutative and the order in which

the Jet Feature building blocks are applied does not matter. Therefore, every Jet Feature is defined

by the number of each building block it uses. For example, the 3x3 x-direction Sobel kernel can

10

Figure 2.2: The four basic kernels of all Jet Features. The top two kernels can be thought of asscaling or blurring factors. The bottom two perform derivatives in either the x- or y-direction.Every Jet Feature is simply a series of convolutions with any number of each of these kernels. Theorder does not matter.

Figure 2.3: These examples demonstrate how the Gaussian kernel and Sobel kernels are examplesof Jet Features. These 3x3 kernels can be broken down into a series of four convolutions with thetwo cell Jet Feature kernels. The Sobel kernels are similar to the Gaussian, but one of the scalingfactors is replaced with a partial derivative.

11

be defined as 1 x-direction and 2 y-direction scaling factors and 1 x-direction partial derivative

operator (see Figure 2.3).

2.2.2 Multiscale Local Jets

We can more formally define a jet feature as an image transform that is selected from a

multiscale local jet. All features for the algorithm are selected from the same multiscale local jet.

Multiscale local jets were proposed by Florack et al. [17] as useful image representations that could

capture both the scale and spatial information within an image. They have proven to be useful for

various computer vision tasks such as feature matching, feature tracking, image classification, and

image compression [20] [21] [22]. Manzanera constructed a single unified system for several of

these tasks using multiscale local jets and attributed its effectiveness to the fact that many other

features are implicitly contained within a multiscale local jet [22]. Some of these popular features

include the Gaussian blur, the Sobel operator, and the Laplacian filter.

Multiscale local jets are a set of partial derivatives of a scale space of a function. Members

of a multiscale local jet have been previously defined in [17] [20] and [22] as

Lxmynσ (A) = A∗δxmynGσ , (2.1)

where A is an input image, δxmyn is a differential operator to the degree of m with respect to x and

degree n with respect to y and Gσ is the Gaussian operator with a variance of σ . A multisacle local

jet is a set of outputs Lxmynσ (A) for a given range of values for m,n, and σ :

Λxayb[σc,σd ](A) = {Lx0y0σc

(A), ...,Lxaybσd(A)}︸ ︷︷ ︸

for all combinations between

(2.2)

2.3 The ECO Features Algorithm

The ECO Features algorithm was originally developed in [19] and [16]. Its main purpose

was to automatically construct good image features that could be used for classification. This elim-

inated the need for man experts to hand craft features for specific applications. This algorithm was

developed as CNNs were gaining popularity, which solved similar problems [11]. We recognize

12

that CNNs are able to achieve better accuracy than the ECO Features algorithm in most tasks,

but ECO Features are smaller and generally less computationally expensive. In this chapter we

are interested in the effectiveness of Jet Features in the ECO Features algorithm. The impact of

Jet Features are fairly straightforward to explore when working with the ECO Features algorithm.

Exploration of Jet Features in CNNs is left for future work.

An ECO Feature is a series of image transforms performed back to back on an input image.

Figure 2.4 shows an example of a hypothetical ECO Feature. Each transform in the feature can

have a number of parameters that change the effects of the transform. The algorithm starts with

a predetermined pool of transforms which are selected by the user. Table 2.1 shows the pool of

transforms used in [16].

Figure 2.4: An example of a hypothetical ECO Feature made up of three transforms. The topboxes represent the type of each transform. The boxes below show each transform’s associatedparameters. The number of transforms, transform types and parameters of each transform arerandomly initialized and then evolved through a genetic algorithm.

The genetic algorithm initially forms ECO Features by selecting a random series of trans-

forms and randomly setting each of their parameters. The parameters of each transform are modi-

fied through the process of mutation in the genetic algorithm. New orderings of the transforms are

also created as pairs of ECO Features are joined together through genetic crossover, where the first

part of one series is spliced with the latter portion of a different series. A graphical representation

of mutation and crossover is shown in Figure 2.5.

Each ECO Feature is paired with a classifier. An example is given in Figure 2.6. Originally,

single perceptrons were used as the classifiers for each ECO Feature. Since perceptrons are only

capable of binary classification, we seek to extend the capabilities of the algorithm to perform

multiclass classification. We replaced the perceptrons with random forest [23] classifiers in this

work. Inputs are fed through the ECO Feature transforms and the outputs are fed into the classifier.

A hold set of images is then used to evaluate the accuracy of each ECO Feature. This accuracy is

13

Table 2.1: The pool of possible image transforms used in the ECO Features Algorithm.

Transform Parameters Transform ParametersGabor filter 6 Sobel operator 4

Gradient 1 Difference of Gaussians 2

Square root 0 Morphological erode 1

Gaussian blur 1 Adaptive thresholding 3

Histogram 1 Hough lines 2

Hough circles 2 Fourier transform 1

Normalize 3 Histogram equalization 0

Log 0 Laplacian Edge 1

Median blur 1 Distance transform 2

Integral image 1 Morphological dilate 1

Canny edge 4 Harris corner strength 3

Rank transform 0 Census transform 0

Resize 1 Pixel statistics 2

Figure 2.5: An example of mutation (top) and crossover (Bottom). Mutation will only change theparameters of a given ECO Feature. Crossover takes the first part of one feature and appends thelatter part of another feature to it.

14

used as a fitness score when performing genetic selection in the genetic algorithm. ECO Features

with high fitness scores are propagated to future rounds of evolution while weak ECO Features die

off. The genetic algorithm continues until a single ECO Feature outperforms all others for a set

number of consecutive generations. This ECO Feature is selected and saved while all others are

discarded. This process is repeated N times where N is the number of desired ECO Features.

Figure 2.6: An example pairing of an ECO Feature with a random forest classifier. Every ECOFeature is paired with its own classifier. Originally, perceptrons were used, but in our work, randomforests are used which offer multiclass classification.

As the genetic algorithm selects ECO Features, they are combined to form an ensemble

using a boosting algorithm. We use the SAMME [24] variation of AdaBoost [25] for multiclass

classification. The boosting algorithm adjusts the weights of the dataset giving importance to

harder examples after each ECO Feature is created. This leads to ECO Features tailored to certain

aspects of the dataset. Once the desired number of ECO Features have been constructed, they are

combined into an ensemble. This ensemble predicts the class of new input images by passing the

image through all of the ECO Feature learners, letting each one vote for which class should be

predicted. Figure 2.7 depicts a complete ECO Features system.

Figure 2.7: A system diagram of the original ECO Features Algorithm. Each classifier has its ownECO Feature transform. The outputs of each classifier are collected into a weighted summation todetermine the final prediction.

15

Since the publications of [19] and [16], ECO Features was applied to the problem of visual

inspection where ECO Features were used to determine the maturity of date fruits [26]. This

algorithm has also been used in industry to automate visual inspection for other processes.

2.4 The ECO Jet Features Algorithm

In this section, we look at how Jet Features can be introduced into the ECO Features al-

gorithm. We call this modified version the ECO Jet Features algorithm. This modification sped

up performance while maintaining accuracy on simple image classification. It was specifically

designed to allow for easy implementation in hardware.

2.4.1 Jet Feature Selection

The ECO Jet Features algorithm uses a similar genetic algorithm to the one discussed in

Section 2.3. Instead of selecting image transforms from a pool and combining them into a series, it

simply uses a single Jet Feature. The amount of scaling and partial derivatives are the parameters

that are tuned through evolution. These four parameters are bounded from 0 to a set maximum,

forming the multsicale local jet, similar to equation (2.2). We found that bounding the partial

derivatives, δx,δy ∈ [0,2], and scaling factors, σx,σy ∈ [0,6], is effective at finding good features.

In order to accommodate the use of jet features, mutation and cross over are redefined. The

four parameters of the jet feature, δx,δy,σx and σy, are treated like genes that make up the genome

of the feature. During mutation, the values of these individual parameters are altered. During cross

over, the genes of a child jet feature would each be copied from either the father or the mother

genome. This selection is made randomly. This is illustrated in Figure 2.8

2.4.2 Advantages in Software

The jet feature transformation can be calculated with a series of matrix shifts, additions,

and subtractions. Since the elements of the bases kernels for the transformations are either 1 or

-1, there is no need for general convolution with matrix multiplication. Instead, a jet transform

can be applied to image A by making a copy of A, shifting it in either the x or y directions by one

pixel and then adding or subtracting it with the original. Padding is not used. Using jet transforms,

16

Figure 2.8: How mutation (top) and crossover (bottom) are defined for Jet Features.

there is no need for multiplication or division operations. We do recognize that normalization is

normally used with traditional kernels, however, since this normalization is applied to all elements

equally of an input image and the output values are fed into a classifier, we argue that the only

difference normalization makes is to keep the intermediate values of the image representation

reasonably small. In practice, we see no improvement in accuracy by normalizing during the jet

feature transform.

Another property of Jet Features that allows for efficient computation is the fact that one

Jet Feature of a higher order can be calculated using the result of a Jet Feature of a lower order.

The outputs of the lowest order jet features can be used as an input to any other ECO Jet Feature

that has parameters of equal or greater value. Calculating all of the jet features in an ensemble

in the shortest amount of time can be seen as an optimization problem where the order of which

features are calculated is optimized to require the minimum number of operations. We explored

optimization strategies that would find the most efficient order of buffers and arithmetic units for

a given ensemble of jet features. We did not see much improvement when employing complex

scheduling strategies. The most effective and simple strategy was calculating features with the

lowest sum of δx,δy,σx, and σy first and working to higher ordered features, reusing lower ordered

outputs where we could.

17

2.4.3 Advantages in Hardware

Jet features were developed to make calculations in hardware simpler for our new algo-

rithm than the original ECO Features algorithm. The original ECO Features algorithm has several

attributes that make it difficult to implement in hardware. Similar to the advantages discussed in

Section 2.4.2, the jet features are even more advantageous in a hardware implementation.

First, the original algorithm forms features from a generic pool of image transforms. This

is relatively straightforward to implement in software when a computer vision library is available,

only requiring extra room in memory for the library calls. In hardware physical space in silicon

must be dedicated to units to perform each of these transforms. The jet feature transform utilizes a

set of simple operations that are reused in every single jet feature.

Second, the transforms of the original algorithm are not commutative. The order they are

executed affects the output. Intermediate calculations would need to have the ability to be routed

from every transform to every other transform. This kind of complexity could be tackled with a

central memory, a bus system, redundant transform modules, and/or a scheduler. The jet transform

is cumulative and the order of convolutions does not matter. Routing intermediate calculations

becomes trivial.

Third, intermediate calculations from the original ECO Feature transformations can rarely

be used in any other ECO Feature. On the other hand, jet features are cumulative. Using this

property, the ECO Jet Features algorithm is easily pipelined and calculations for multiple features

can be calculated simultaneously. In fact, instead of scheduling the order in which features are

calculated, our architecture calculates every possible feature every time an input image is received.

This allows for easy reprogrammability for different applications. The feature outputs required for

that specific model are used and the others are ignored. Little extra hardware is required, and there

is no need for a dynamic control unit.

Fourth, calculating jet features in hardware requires only addition and subtraction operators

in conjunction with pixel buffers. The transforms of the original ECO Features algorithm require

multiplication, division, procedural algorithm control, logarithm operators, square root operators

and more to implement all of the transforms available to the algorithm. In hardware, these opera-

tions can require large spaces of silicon and can generate bottlenecks in the pipeline. As mentioned

in Section 2.4.2, the Gaussian blur does require a division by two when normalizing. However,

18

with a fixed base-two number system, this does not require any extra hardware. It is merely a left

shift of the imaginary decimal place.

2.5 Hardware Architecture

The ECO Jet Features hardware architecture consists of two major parts, a jet feature unit,

and a classifier unit. A simple routing module connects the two, as shown in Figure 2.9. Input

image data is fed into the jet features unit as a series of pixels, one pixel at a time. This type

of serial output is common for image sensors, but we acknowledge that if the ECO Jet Features

algorithm was embedded close to the image sensor, other more efficient types of data transfer

would be possible. As the data is piped through the jet features unit, every possible jet feature

transform is calculated. Only the features that are relevant to the specific loaded model are then

routed to the classifier unit. The classifier unit contains a random forest for every ECO Jet Feature

in the model and the appropriate output from the jet features unit is processed by the corresponding

random forest.

Figure 2.9: System diagram for the ECO Jet architecture. The Jet Features Unit computes everyfeature for a given multiscale local jet on an input image. A router connects only the ones thatwere selected during training to a corresponding random forest. The forests each vote on a finalprediction.

2.5.1 The Jet Features Unit

The jet features unit calculates every feature for a given multiscale local jet. An input image

is fed into the unit one by one, in row-major order. As pixels are piped through the unit it produces

multiple streams of pixels, one stream for every feature in the jet.

19

All convolutions in jet feature transforms require the addition or subtraction of two pixels.

This is accomplished by feeding the pixels into a buffer, where the incoming pixel is added or

subtracted from the pixel at the end of the buffer, as shown in Figure 2.10. Convolutions in the x

direction (along the rows) require only a single pixel to be buffered due to the fact that the image

sensor transmits pixels in row-major order. Convolutions in the y direction, however, require pixel

buffers to be the width of the input image. A pixel must wait until a whole row of pixels is read in

for its neighboring pixel to be fed into the system.

Figure 2.10: Convolution units. The top unit shows only a single buffer needed for convolutionalong rows in the x direction. The bottom unit shows a large buffer used for convolution along thecolumns in the y direction.

With units for convolution in both the x and y directions, an array of convolutional units is

connected to produce every jet feature for a given multiscale local jet. By restricting the multiscale

local jet, there are fewer possible jet features. In order to see the effect of restricting the maximum

allowed values for δ and σ , we tested various configurations of the BYU Fish Species dataset

(Section 2.6.1). This dataset is further explained in Section 2.6. We restricted both δ and σ to

a maximum of 15,10 and 5. Each of these configurations was trained and tested. We observed

that the genetic algorithm often selected either 0 or 1 as values for δ and so a configuration where

δ ≤ 1 and σ ≤ 5 was tested as well. Figure 2.11 shows the average test accuracy for each of

20

these configurations as the model is being trained. From these results, we feel confident that

restricting δ and σ does not hurt the algorithm’s accuracy significantly. It does, however, restrict

the space significantly, which can mean a much more compact hardware design. In our hardware

architecture, we restrict δ ≤ 1 and σ ≤ 5 and only have 144 different possible jet features.

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80Number of Jet Features

0.00

0.02

0.04

0.06

0.08

0.10

Error R

ate

max 15max 10max 5max 5/max 1

Figure 2.11: Comparison of multiscale local jets. The maximum allowable variance in Gaussianblur and order of differentiation was limited by 15,10 and 5. A fourth case was tested wherevariance in Gaussian blur was bounded by 5 and order of differentiation and was bounded by 1.

In order to calculate every jet feature using the least amount of hardware, we perform

convolutions in the y direction first. Since these convolutions require whole line buffers, it is best

to minimize the required number of these units. Each of these convolutions is then fed into similar

modules that perform the convolutions in the x-direction, each producing its own 12 jet features

for a total of 144 different features.

21

2.5.2 The Random Forest Unit

A random forest is a collection of decision trees. Each tree votes individually for a partic-

ular class based on the values of specific incoming pixels. Nodes of these trees act like gates that

pass their input to one of two outputs. Each node is accosted with a specific pixel location and

value. If the actual value of that pixel is less than the value associated with the node, the “left”

output is opened. If the actual value is greater, the “right” output is opened.

Every random forest receives the location information for every incoming pixel. Each tree

contains a lookup table for storing comparison values of each pixel and which pixel location is

associated with it. Entries are stored in the order in which they will be read into the tree, in row-

major order. A pointer is used to keep track of which entry will be seen next by the tree. Once the

target pixel arrives, the pointer moves to the next entry. The sorting of these node values is done

by the host computer system and it is assumed they are loaded in the proper order.

Once there is an open path from the root node of a tree to a leaf node, the tree sends

its prediction, stored at the leaf node. Once all trees in a forest have sent their predictions, the

prediction with the most votes is sent to the tabulating unit. Once all forests have finished voting,

their votes are combined with their weights learned from the AdaBoost process. The classification

with the highest score is selected as the final prediction.

In order to select an efficient configuration, we experimented with different sizes of random

forests and different numbers of jet features. We varied the number of trees in a forest, the maxi-

mum depth of each forest, and the total number of creatures, which corresponds with the number

of forests. Figure 2.12 shows the accuracy of these different configurations compared with the total

count of nodes in their forests. A pattern of diminishing returns is apparent as models grow to be

more than 3,000 nodes. The models that performed best were ones with at least 10 creatures and

a balance between tree count and tree depth. We used a configuration of 10 features with random

forests of 5 trees, each 5 levels deep. This setup requires 3,150 nodes.

22

0 10000 20000 30000 40000 50000 60000Number of Nodes

0.6

0.7

0.8

0.9

1.0Te

st Accuracy

Figure 2.12: Comparison of test accuracy to the number of total nodes in the random forests.Various number of trees, forests and depths of trees were tested.

2.6 Results

2.6.1 Datasets

ECO Features was designed to solve visual inspection applications. These applications

typically involve fixed camera conditions where the objects being inspected are similar. This in-

cludes manufactured goods that are being inspected for defects or agricultural produce that is being

graded for size or quality. These applications usually are fairly specific and real-world users do not

have extremely large datasets.

We first explored the accuracy of ECO Features and ECO Jet Features on the MNIST and

CIFAR-10 datasets. Both are common datasets used in deep learning with small images with only

10 different classes in each dataset. The MNIST dataset consists of 70,000 28x28 pixel images

(10,000 reserved for testing) and the CIFAR-10 dataset consists of 60,000 32x32 pixel sized images

23

(10,000 reserved for testing). The MNIST dataset features handwritten numerical examples and

the CIFAR-10 images each consist of various objects. Examples are shown in Figure 2.13.

Figure 2.13: Examples from the MNIST dataset (top) with handwritten digits. Examples from theCIFAR-10 dataset (bottom). Classes include airplane, bird, car, cat, deer, dog, frog, horse, ship,and truck.

We also tested our algorithms on a dataset that is more typical for visual inspection tasks.

MNIST and CIFAR-10 contain many more images than what is typically available to users solv-

ing specific visual inspection tasks. Visual inspection training sets also include less variation in

object type and camera conditions than in the CIFAR-10 dataset. The MNIST and CIFAR-10

datasets consist of small images, which makes execution time atypically fast for visual inspection

applications. For these reasons we also used the BYU Fish dataset in our experimentation.

The BYU Fish dataset consists of images of fish from eight different species. The images

are 161 pixels wide by 46 pixels tall. We split the dataset to include 778 training images and

254 test images. Images were converted to grayscale before being passed to the algorithm. Each

specimen is oriented in the same way and the camera pose remains constant. This type of dataset

is typical for visual inspection systems where camera conditions are fixed and a relatively small

number of examples are available. Examples are shown in Figure 2.14.

24

Figure 2.14: Examples from the BYU Fish dataset. Each image is of a different fish species.

2.6.2 Accuracy on MNIST CIFAR-10

To get a feel for how Jet Features changes the capacity of ECO Features to learn, we trained

the ECO Features algorithm and the ECO Jet Features algorithm on the MNIST and CIFAR-10

datasets. These datasets have many images and were specifically designed for deep learning algo-

rithms which can take advantage of such a large training set. We note that the absolute accuracy

of these models did not compare well with the state-of-the-art deep learning, but we use these

larger datasets to fully test the capacity of our ECO Jet Features in comparison to the original ECO

Features algorithm.

Each model was trained with random forests of 15 trees up to 15 levels deep. When testing

on CIFAR-10, each model trained 200 creatures. The accuracy as features were added is shown in

Figure 2.15. The models were only trained to 100 creatures on MNIST where the models seem to

converge, as shown in Figure 2.16. The CIFAR-10 results show that the models converge to similar

accuracy while ECO Jet Features show a slight improvement ( 0.3%) over the original algorithm on

MNIST. From these results, we conclude that Jet Features introduce no noticeable loss in accuracy.

2.6.3 Accuracy on BYU Fish dataset

We also trained on the BYU Fish dataset with the same experimental setup that was used

on the other datasets. The results are plotted in 2.17. While the datasets do seem to converge to

25

Figure 2.15: Accuracy comparison between the original ECO Features algorithm and the ECO JetFeatures algorithm on CIFAR-10. Once the models seem to converge, there is no evidence of lostaccuracy in the ECO Jet Features model.

similar accuracy, results from training using such a small dataset may not be quite as convincing

as those obtained using larger datasets. These results were for completeness since this dataset was

used in our procedure and meant for testing speed, efficiency, and model size.

2.6.4 Software Speed Comparison

While the primary objective of the new algorithm was to be hardware friendly, it was inter-

esting to explore the speedup gained in software. Each algorithm was implemented on a full-sized

desktop PC running a Skylake i7 Intel processor, using the OpenCV library. OpenCV contains

built-in functionality for most of the transforms from the original ECO Features algorithm. It also

provides vectorization for Jet Feature operations.

26

Figure 2.16: Accuracy comparison between the original ECO Features algorithm and the ECO JetFeatures algorithm on MNIST. Once the models seem to converge, ECO Jet Features seem to havea slight edge in accuracy, about 0.3%.

We attempted to accelerate these algorithms using GPUs but found this was only possible

on images that were larger than 1024x768. Even using images that were this large did not provide

much acceleration. The low computational cost of the algorithm does not justify the overhead of

CPU to GPU data transfer.

A model of 30 features was created for both. The BYU Fish dataset was used because the

image sizes are more typical of real-world applications. The original algorithm averaged a run

time of 10.95ms and our new ECO Jet Features algorithm averaged an execution time of 2.95ms,

which is a 3.7x speedup.

27

Figure 2.17: Accuracy comparison between the original ECO Features algorithm and the ECO JetFeatures algorithm on the BYU Fish dataset.

Table 2.2: ECO Jet Features architecture total hardware usage on a Kintex-7 325 FPGA using 10creatures, 5 trees at a depth of 5.

Resource Number Used Percent of AvailableTotal Slices 10,868 4.9%

Total LUTs 34,552 4.9%

LUTs as Logic 31,644 4.4%

LUTs as Memory 2,908 1.0%

Flip Flops 17,132 1.2%

BRAMs 0 0%

DSPs 0 0%

28

Table 2.3: The hardware usage for individual design units.

Unit Slices Total LUTs LUTs as Logic LUTs as Memory Flip FlopsJet Features Unit 7,593 23,253 22,37 876 11,411Random Forests Unit 3,741 11,080 9,080 2,000 5,080

Individual Random Forests 374 1,108 908 200 508Individual Decision Trees 73 210 171 40 93

Feature router 49 40 40 0 520AdaBoost Tabulation 61 180 148 32 121

2.6.5 Hardware Implementation Results

Our hardware architecture was designed in SystemVerilog. It was synthesized and imple-

mented for a Xilinx Vertex-7 FPGA using the Vivado design suite. From our analysis reported

in Section 2.5, Figures 2.11 and 2.12, we implemented a model with 10 features, 5 trees in each

forest with a depth of 5, a maximum σ of 5, and maximum δ of 1. We used a round number of 100

pixels for the input image width. A model built around the BYU Fish dataset would have required

only 46 pixels in its line buffers, but this length is small due to the oblong nature of fish. We feel a

width of 100 pixels is more representative of general visual inspection tasks.

The total utilization of available resources on the Xilinx Vertex-7 is reported in Table 2.2.

One interesting point is that this architecture requires no Block RAM (BRAM) or Digital Signal

Processing (DSPs) units. BRAMs are dedicated RAM blocks that are built into the FPGA fabric.

DSPs are generally used for more complex arithmetic operations, like general convolution. Our

architecture, however, was compact enough and simple enough to not require either of these re-

sources and instead host all of its logic in the main FPGA fabric. Look Up Tables (LUTs), make

up the majority of the fabric area and are used to store all data and perform logic operations.

To give a quick reference of FPGA utilization for a CNN on a similar Vertex 7 FPGA,

Prost-Boucle et al. [27] reported using 22% to 74.4% of the 52.6 Mb of total BRAM memory for

various sizes of the model. Our model did not require the use of any of these BRAM memory

units. When comparing the number of LUTs used as logic, Prost-Boucle et al. used 112% more

than our model in their smallest model and 769% more on their larger more accurate model.

The pixel clock speed can safely reach 200MHz. Since the design was fully pipelined

around the pixel clock, images from the BYU Fish dataset could, in theory, be processed in 37µs.

29

This is a 78.3x speedup over the software implementation on a full-sized desktop PC. A custom

silicon design could be even faster than this FPGA implementation.

Table 2.3 shows the relative sizes for individual units of the design. Some FPGA logic

slices are shared between units and the sum of the individual unit counts exceeds the totals listed in

2.2. With a setup of 10 creatures, 5 trees per forest with a depth of 5, the Jet Features Unit makes

up about 70% of the total design. However, since this unit is generating every jet in the multiscale

local jet, it does not grow as more features are added to the model. We showed in Figure 2.11 that

using a large local jet does not necessarily improve performance.

The Random Forest unit makes up less than 35% percent of the design in all aspects other

than LUT units that are used as memory, which is a subset of total LUTs. But, only 10 features

were used and more could be added to increase accuracy as shown in Figure 2.17. Extrapolating out

from these numbers, if all 144 possible features were added to this design, only 30% of resources

available to the Vertex-7 would be used and 87.9% of them will be dedicated to the Random Forests

Unit.

These results showed how compact this architecture is. The simple operations and feedfor-

ward paths used in this design could very feasibly be implemented in custom silicon as well.

2.7 Smart Camera System

We built a compact smart camera for automated visual inspection to demonstrate how ECO

Jet Features can be used in an industrial setting. The camera performed real-time image classifica-

tion using ECO Jet Features. It sent signals to sorting mechanisms to let it know when to activate

and sort objects according to their classification.

2.7.1 System Overview

The smart camera system consisted of three major parts, an Nvidia TX1 embedded GPU/CPU,

an Arduino Uno microcontroller, and a FLIR Grasshopper high-speed camera, as shown in Figure

2.18.

Our original smart camera design targeted a GPU-powered algorithm which is why it fea-

tures an Nvidia TX1. The TX1 is a small GPU/CPU system designed for embedded platforms.

30

Figure 2.18: System diagram of the smart camera system.

ECO Jet Features is efficient enough to not require a GPU. Instead, the ARM Cortex A57 on the

TX1 is enough to run ECO Jet Features image classification in real time. The CPU classifies the

incoming images from the camera and sends the results to the microcontroller.

The Arduino UNO microcontroller is used to trigger sorting mechanisms. It kept a map

of all incoming objects, where they were, which sorting mechanisms need to be triggered, when

they need to be triggered, and for how long. In our demonstration, the Arduino was connected

to pneumatic valves that shot air at the objects in order to blow them off a conveyor belt and into

sorting bins. The Arduino was connected to the conveyor belt in order to monitor the speed of the

belt. This allowed the speed of the belt to be adjusted.

A FLIR Grasshopper high-speed camera was used to capture images of the objects passing

under the camera on a conveyor belt. Dedicated I/O pins on the camera were connected to the

31

Arduino to trigger the camera in real time. The Arduino tracked the speed of the conveyor belt and

triggers the camera when a new section of the belt enters the camera frame.

2.7.2 Software

We developed a custom software GUI to control, configure and train the smart camera

system. The software automatically calibrated the microcontroller to determine how fast objects

pass under the camera. It could be used to capture a set of training data to train the ECO Features

algorithm by passing example objects under the camera and later labeling them. It could control

the camera’s settings like aperture, focus, and field of view.

2.8 Conclusion

We have presented Jet Features, learned convolutional kernels that are efficient in both

software and hardware implementations. We applied them to the ECO Features algorithm. This

change to the algorithm allows faster software execution and hardware implementation. In soft-

ware, the algorithm gained a 3.7x speedup with no noticeable loss in accuracy. We also presented

a compact hardware architecture for our new algorithm that is fully pipelined and parallel. On an

FPGA, this architecture can process images in 37µs, a 78.3x speedup over the improved software

implementation.

Jet Features are related to the idea of multiscale local jets. Large groups of these transforms

can be calculated in parallel. They incorporate many other common image transforms such as the

Gaussian blur, Sobel edge detector, and Laplacian transform. The simple operators required to

calculate jet features allows them to be easily implemented in hardware in a completely pipelined

and parallel fashion.

With a compact classification architecture for visual inspection, automatic visual inspec-

tion logic can be embedded into image sensors and compact hardware systems. Visual inspection

systems can be made smaller, cheaper, and available to a wider range of visual inspection applica-

tions.

32

CHAPTER 3. A REVIEW OF BINARIZED NEURAL NETWORKS

In this chapter, we review an existing use of binary values for image classification called

Binarized Neural Networks (BNNs). BNNs are deep neural networks that use binary values for

activations and weights, instead of full precision values. With binary values, BNNs can execute

computations using bitwise operations, which reduces execution time. Model sizes of BNNs are

much smaller than their full precision counterparts. While the accuracy of a BNN model is gener-

ally less than full precision models, BNNs have been closing the accuracy gap and are becoming

more accurate on larger datasets like ImageNet. BNNs are also good candidates for deep learning

implementations on FPGAs and ASICs due to their bitwise efficiency. We give a tutorial of the

general BNN methodology and review various contributions, implementations, and applications of

BNNs.

3.1 Introduction

Deep neural networks (DNNs) are becoming more powerful. However, as DNN models

become larger they require more storage and computational power. Edge devices in IoT systems,

small mobile devices, power-constrained, and resource-constrained platforms all have constraints

that restrict the use of cutting-edge DNNs. Various solutions have been proposed to help solve

this problem. Binarized Neural Networks (BNNs) are one solution that tries to reduce the memory

and computational requirements of DNNs while still offering similar capabilities of full precision

DNN models.

There are various types of networks that use binary values. In this chapter, we focus on

networks based on the BNN methodology first proposed by Courbariaux et al. in [28] where

both weights and activations only use binary values, and these binary values are used during both

inference and backpropagation training. From this original idea, various works have explored

33

how to improve their accuracy and how to implement them in low-power and resource-constrained

platforms.

Almost all work on BNNs has focused on advantages that are gained during inference time,

rather than training time. Unless otherwise stated, when the advantages of BNNs are mentioned in

this chapter, we will assume these advantages apply mainly to inference. However, we will look at

the advantages of BNNs during training as well. Since BNNs have received substantial attention

from the digital design community, we also focus on various implementations of BNNs on FPGAs.

BNNs build upon previous methods for quantizing and binarizing neural networks, which

are reviewed in Section 3.3. Since terminology throughout the BNN literature may be confusing

or ambiguous, we review important terms in Section 3.2. We outline the basic mechanics of BNNs

in Section 3.4. Section 3.5 details the major contributions to the original BNN methodology.

Techniques for improving accuracy and execution time at inference are covered in Section 3.6. We

present accuracies of various BNN implementations on different datasets in Section 3.7. FPGA

and ASIC implementations are highlighted in Sections 3.8.1 and 3.8.5.

3.2 Terminology

Before diving into the details of BNNs and how they work, we want to clarify some of the

terminologies that will be used throughout this review. Some of the terms used in the literature

interchangeably and can be ambiguous.

Weights: Learned values that are used in a dot product with activation values from previous

layers. In BNNs, there are real-valued weights that are learned and binary versions of those weights

which are used in the dot product with binary activations.

Activations: The outputs from an activation function that are used in a dot product with the

weights from the next layer. Sometimes the term “input” is used instead of activation. We use the

term “input” to refer to input to the network itself and not just the inputs to an individual layer. In

BNNs, the output of the activation function is a binary value and the activation function is the sign

function.

Dot product: A multiply-accumulate operation occurs in the “neurons” of a neural network.

The term “multiply-accumulate” is used at times in the literature, but we use the term dot product

instead.

34

Parameters: All values that are learned by the network through backpropagation. This

includes weights, biases, gains, and other values.

Bias: An additive scalar value that is usually learned. Found in batch normalization layers

and specific BNN techniques that will be discussed later.

Gain: A scaling factor that is usually learned, (but sometimes extracted from statistics (Sec-

tion 3.5.2)). Similar to bias. A gain is applied after a dot product between weights and activations.

The term scaling factor is used at times in the literature, but we use gain here to emphasize its

correlation with bias.

Topology: The specific arrangement of layers in a network. The term “architecture” is used

frequently in the DNN community. However, the digital design and FPGA community also use the

term architecture to refer to the arrangement of hardware components. We use topology to refer to

the layout of the DNN model.

Architecture: The connection and layout of digital hardware. Not to be confused with the

topology of the DNN models themselves.

Fully Connected Layer: As a clarification, we use the term fully connected layer instead of

dense layer like some of the literature reviewed in this chapter.

3.3 Background

Various methods have been proposed to help make DNNs smaller and faster without sac-

rificing excess accuracy. Howard et al. proposed channel-wise separable convolutions as a way

to reduce the total number of weights in a convolutional layer [29]. Other low rank and weight

sharing methods have been explored [30, 31]. These methods do not reduce the data width of the

network, but instead, use fewer parameters for convolutional layers while maintaining the same

number of channels and kernel size.

SqueezeNet is an example of a network topology designed specifically to reduce the num-

ber of parameters used [32]. SqueezeNet requires fewer parameters by using more 1× 1 kernels

for convolutional layers in place of some 3×3 kernels. They also reduce the number of channels

in the convolutional layers to reduce the number of parameters even further.

Most DNN models are overparameterized and network pruning can help reduce size and

computation [33–35]. Neurons that do not contribute much to the network can be identified and

35

removed from the network. This leads to sparse matrices and potentially smaller networks with

fewer calculations.

3.3.1 Network Quantization Techniques

Rather than reducing the total number of parameters and activations to be processed in a

DNN, quantization reduces the bit width of the values used. Traditionally, 32-bit floating-point

values have been used in deep learning. Quantization techniques use data types that are smaller

than 32-bits and tend to focus on fixed point calculations rather than floating-point. Using smaller

data types can offer a reduction in total model size. In theory, arithmetic with smaller data types

can be quicker to compute and fixed-point operations can be more efficient than floating-point

ones. Gupta et al. show that reducing datatype precision in a DNN offers reduced model size with

limited reduction in accuracy [36].

We note, however, that 32-bit floating-point arithmetic operations have been highly op-

timized in GPUs and most CPUs. Performing fixed-point operations on hardware with highly

optimized floating-point units may not achieve the kinds of execution speed advantages that over-

simplified speedup calculations might suggest.

Courbariaux et al. compare accuracies of trained DNNs using various sizes of fixed and

floating-point values for weights and activations [37]. They even examine the effect of a hybrid

dynamic fixed-point data type and show how comparable accuracy can be obtained with sub-32-bit

precision.

Using quantized values for gradients has also been explored in an effort to reduce training

time. Zhou et al. experiment with several low bit widths for gradients [38]. They test various com-

binations of low bit-widths for activations, gradients, and weights. They observe that using higher

precision is more useful in gradients than in activations, and using higher precision in activations

is more useful than in weights.

3.3.2 Early Binarization

The most extreme form of network quantization is binarization. Binarization is a 1-bit

quantization where data can only have two possible values. Generally, −1 and +1 have been used

36

for these two values (or −γ and +γ when scaling is considered, see Section 3.6.1). Notice that

quantized networks that use the values −1 and 0 and +1 are not binary, but ternary, a confusion in

some of the literature [39–42]. They exhibit a high level of compression and simple arithmetic but

do not benefit from the single bit simplicity of BNNs since they require 2-bits of precision.

The idea of using binary weights predates the current boom in deep learning [43]. Early

networks with binary values contained only a single hidden layer [43,44]. These early works point

out that backpropagation (BP) and stochastic gradient descent (SGD) cannot be directly applied

to these networks since weights cannot be updated in small increments. As an alternative, early

works with binary values used variations of Bayesian inference. More recently [45] applies a

similar method, Expectation Backpropagation, to train deep networks with binary values.

Courbariaux et al. claim to be the first to train a DNN from start to finish using binary

weights and BP with their BinaryConnect method [46]. They use real-valued weights which are

binarized before being used by the network. During backpropagation, the gradient is applied to

the real-valued weights using the Straight-Through Estimator (STE) which is explained in Section

3.4.1.

While binary values are used for the weights, Courbariaux et al. retain full precision ac-

tivations in BinaryConnect. This eliminates the need for full precision multiplications but still

requires full precision accumulations. BinaryConnect is named in reference to DropConnect [47],

but connections are binarized instead of being dropped.

These early works in binary neural networks are certainly binary in a general sense. How-

ever, this chapter defines BNNs as networks that use binary values for both weights and activations

allowing for bitwise operations instead of multiply-accumulate operations. Soudry et al. was one

of the first research groups to focus on DNNs with binary weights and activations [45]. They use

Bayesian learning to get around the problems of learning with binary values [45]. However, Cour-

bariaux et al. are able to use binary weights and activations during training with backpropagation

techniques and take advantage of bitwise operations [28, 48]. Their BNN method is the basis for

most binary networks that have come since (with some notable exceptions in [49, 50]).

37

3.4 An Introduction to BNNs

Courbariaux et al. [28, 48] develop the BNN methodology that is used by most network

binarization techniques. In this section, we will review the functionality of this original BNN

methodology. Other specific details from [28, 48] will be reviewed in Section 3.5.1

In BNNs, both the weights and activations are binarized. This reduces the memory require-

ment for BNNs and the computational complexity through the use of bitwise operations.

3.4.1 Binarization of Weights

Courbariaux et al. first provide a way to train using binary weights in [46] using back-

propagation with a gradient-descent based method (SGD, Adam, etc.). Using binary values during

training provides a more representative loss to train against instead of only binarizing a network

once training is complete. Computing the gradient of the loss w.r.t binary weights through back-

propagation is not a problem. However, updates to the weights using gradient descent methods

(SGD, Adam, etc.) prove impossible with binary weights. Gradient descent methods make small

changes to the value of the weights, which cannot be done with binary values.

In order to solve this problem, Courbariaux et al. keep a set of real-valued weights, WR,

which are binarized within the network to obtain binary weights, WB. WR can then be updated

through backprop and the incremental updates gradient descent. During inference, WR is not

needed and the binary weights are the only weights that are stored and used. Binarization is done

using a simple sign function

WB = sign(WR) (3.1)

resulting in a tensor with values of +1 and −1.

Calculating the gradient of the loss w.r.t. the real-valued directly weights is meaningless

due to the sign function used in binarization. The gradient of the sign function is 0 or undefined

at every point. To get around this problem, Courbariaux et al. use a heuristic called the Straight

Through Estimator (STE) [51]. This method approximates a gradient by bypassing the gradient of

the layer in question. The problematic gradient is simply turned into an identity function

∂L∂WR

=∂L

∂WB(3.2)

38

where L is the loss at the output. This gradient approximation is used to update the real-valued weights.

This binarization is sometimes thought of as a layer unto itself. The weights are passed

through a binarization layer that evaluates the sign of the values in the forward pass and performs

an identity function during the backward pass, as illustrated in Figure 3.1

Figure 3.1: A visualization of the sign layer and Straight-Through Estimator (STE). While the realvalues of the weights are processed by the sign function in the forward pass, the gradient of thebinary weights is simply passed through to the real-valued weights.

Using the STE, the real-valued weights can be updated with an optimization strategy, like

SGD or Adam. Since the gradient updates can affect the real-valued weights WR without changing

the binary values WB, if values in WR are not bounded, they can accumulate to very large numbers.

For example, if during a large portion of training a positive value of WR is evaluated to have a

positive gradient, every update will increase that value. This can create large values in WR. BNNs

clip the values of WR between −1 and +1. This keeps the values of WR closer to WB.

3.4.2 Binarization of Activations

Binarization of the activation values was introduced in the first BNN paper by Courbariaux

et al. [28]. In order to binarize the activations, they are passed through a sign function using an

STE in the backward pass, similar to how the weights are binarized. This sign function serves as

the activation function in the network. In order to obtain good results, Courbariaux et al. find that

they need to cancel out the gradient in the backward pass if the input to the activation was too large,

39

using∂L∂aR

=∂L∂aB∗1|aR|≤1 (3.3)

where aR is the real-valued input to the activation function and aB is the binarized output of the

activation function. 1|aR|≤1 is the indicator function that evaluates to 1 if |aR| ≤ 1 and 0 otherwise.

This zeros out the gradient if the input to the activation function is too large. This functionality

can be achieved by adding a hard tanh function before the sign activation function, but this layer

would only have any effect in the backward pass and has no effect in the forward pass.

3.4.3 Bitwise Operations

When using binary values, the dot product between weights and activations can be reduced

to bitwise operations. The binary values can either be −1 or +1. These signed binary values are

encoded with a 0 for −1 and a 1 for +1. To be clear, we refer to the signed values −1 and +1 as

binary “values” and their encodings, 0 and 1, as binary “encodings”.

Using an XNOR logical operation on the binary encodings is equivalent to performing

multiplication on the binary values as seen in Table 3.1

Table 3.1: This table shows how the XNOR operation of the encoding can be equivalent to multi-plications of the binary values, in parenthesis.

Encoding (Value) XNOR (Multiply)

0 (−1) 0 (−1) 1 (+1)0 (−1) 1 (+1) 0 (−1)1 (+1) 0 (−1) 0 (−1)1 (+1) 1 (+1) 1 (+1)

A dot product requires an accumulation of all the products between values. XNOR can per-

form the multiplication on a bitwise level, but performing the accumulation requires a summation

of the results of the XNOR operation. Using the binary encodings resulting from the XNOR oper-

ation, this can be done by counting the number of 1 bits in a group of XNOR products, multiplying

this value by 2, and subtracting the total number of bits producing an integer value. Processor

instruction sets often include a popcount instruction to count the number of ones in a binary value.

40

These bitwise operations are much simpler to compute than multi-bit floating-point or

fixed-point multiplication and accumulation. This can lead to faster execution times and/or fewer

hardware resources required. However, theorizing efficiency speedups is not always straightfor-

ward.

For example, when looking at the execution time of a CPU, some papers that we reviewed

here use the number of instructions as a measure of execution time. The 64-bit x86 instruction set

allows a CPU to perform a bitwise XNOR operation between two 64-bit registers. This operation

takes a single instruction. With a similar 64 bit CPU architecture, two 32-bit floating-point multi-

plications could be performed. One could conclude that the bitwise operations would have a 32×

speed up over the floating-point operations. However, the number of instructions is not a measure

of execution time. Each instruction can take a variable amount of clock cycles to execute. Instruc-

tion and resource scheduling within a modern CPU core is dynamic and the number of cycles to

produce an instruction’s result depends on other instructions that have come before. CPUs and

GPUs are optimized for different types of instruction profiles. Instead of using the number of in-

structions as a measure of efficiency, it is better to look at the actual execution times. Courbariaux

et al. [28] observe a 23× speedup when optimizing their code for bitwise operations.

Not only do bitwise operations offer faster execution times in software-based implementa-

tions, but BNNs also require fewer hardware requirements in digital designs.

3.4.4 Batch Normalization

Batch normalization (BN) layers are common practice in deep learning. They condition the

values within a network for faster training and act as a form of regularization. In BNNs, they are

considered essential. BN layers not only condition the values used during training, but they contain

gain and bias terms that are learned by the network. These learned terms help add complexity to

BNN which can suffer without them. The efficiency of BN layers is discussed in Section 3.6.7.

3.4.5 Accuracy

While BNNs are compact and efficient compared to their full precision counterparts, they

suffer degradation in accuracy. The original BNN proposed in [28] suffers a 3% loss in accuracy

41

on the CIFAR-10 dataset and did not show comparable results on the larger ImageNet dataset.

However, with improvement from other authors and modifications to the BNN methodology, more

recent BNNs have achieved comparable results on the ImageNet data set, showing a decrease in

accuracy of 3.1% on top-5 accuracy and 6.0% on the top-1 accuracy [52].

3.4.6 Robustness to Attacks

Full precision DNNs have been shown to be susceptible to adversarial attacks [53, 54].

Small perturbations to an input can cause gross classification in a classifier network. This is espe-

cially true of convolutional networks where input images can be altered in ways that are impercep-

tible to humans but cause extreme failure in the network.

BNNs, however, have shown robustness to this problem [55,56]. Small changes in the input

image have less of an effect on the network activations since discrete values are used. Adversarial

perturbations are generally computed using gradient methods, which, as discussed above, are not

directly commutable in BNNs.

3.5 Major BNN Developments

While there has been much work done on BNNs and how to improve their accuracy, a

handful of works have put forth key ideas that significantly expound upon the original methodology

of BNNs [28]. Before discussing and comparing the literature of BNNs as a whole, we wish to step

through each of these selected works and look at the contributions of each work. These works are

either highly referenced by BNN literature, directly compared to in much of the BNN literature,

and/or made significant changes to the BNN methodology. We feel it is useful to examine them

individually. We recognize that this selection of works is somewhat subjective and works not

included in this section have made contributions as well. After reviewing each of these works in

isolation, we will examine the BNN literature as a whole, summarizing our observations by topic

rather than by publication.

42

3.5.1 The Original BNN

We already summarized the basics of the originally proposed methodology for BNNs in

Section 3.4. Here we will review details specific to [28,48] that were not mentioned earlier. Cour-

bariaux et al. reported their method and results which include details about their experiments on

the MNIST, SVHN, and CIFAR-10 experiments in [28]. In their other paper [48] they did not

include all of the details of these three experiments but did include their preliminary results on the

ImageNet dataset.

While most of the binarization done with BNNs is deterministic using the simple sign func-

tion, Courbariaux et al. discuss stochastic binarization, which can lead to better results than their

BNN model [28] and their earlier BinaryConnect Model [46]. Stochastic binarization binarizes

values using a probability distribution that favors binarizing to −1 when the value is closer to −1

and binarizing to +1 when the value is closer to +1. This helps regularize the training of the

network and produces better test results. However, working and producing probabilities for every

binarization requires more complex computation compared to deterministic binarization. Deter-

ministic binarization is used in almost all of the literature on BNNs.

Aside from the methodology presented in Section 3.4, Courbariaux et al. give details for

optimizing the learning process of BNNs. They point out that training a BNN takes longer than

traditional DNNs due to the STE heuristic needed to approximate the gradient of the real-valued

weights. To speed this process, they make modifications to the BN layer and the optimizer. For

both of these, they use a shift-based method, shifting all of the bits to the left or right, which is

an efficient way of multiplying or dividing a value by two. While this can speed up training time,

the majority of publications on BNNs do not focus on optimization during training time in favor

of test accuracy and speed at inference.

The specific topologies used for the MNIST, SVHN, and CIFAR-10 datasets are reused by

many of the papers that follow [28,48]. Instead of processing the MNIST dataset with convolutions

they used 3 fully connected (FC) layers with 4096 nodes in the hidden layers. For the SVHN and

CIFAR-10 datasets, they use a VGG-like topology with two 128-channel convolutional layers, two

256-channel convolutional layers, two 512-channel convolutional layers, and three fully-connected

layers with 1024 channels for the first two. This topology has been replicated by many works based

on BNNs. BNN topology is discussed in detail in Section 3.7.2.

43

While [28] does not include results on experiments using ImageNet, [48] does provide

some details on how the earliest BNN results for ImageNet. AlexNet and GoogleNet were both

used, replacing their FC convolutional layers with binary versions. While these models do not

perform very well during testing, it is a starting place that other works have built off of.

Courbariaux et al. point out that the bitwise operations of BNNs are not efficiently run

on standard deep learning or frameworks without additional coaxing. They build and provide a

custom GPU kernel that runs efficient bitwise operations. They demonstrate the benefits of their

technique showing a 7× speed up on the MNIST dataset.

A follow-on paper [57] provides responses to the next works reviewed below, applications

for LSTMs, and a generalization to other levels of quantization.

3.5.2 XNOR-Net

Soon after the original work on BNNs [48], Rastegari et al. proposed a similar model called

XNOR-Net [58]. XNOR-Net was proposed to perform well on the ImageNet dataset. XNOR-Net

includes all of the major methods from the original BNN but adds a gain term to compensate for

lost information during binarization. This gain term is extracted from the statistics of the weights

and activations before binarization.

A pair of gain terms is computed for every dot product in the convolutional layer. The

L1-norm of both the weights and activations are multiplied together to obtain this gain term. This

gain term does improve the performance of BNNs as shown by their results, however, their results

may be a bit misleading. Their own results were not reproducible in [38] and do not match the

results reported by Courbariaux et al. [48] or others [38, 59].

The gain term introduced by XNOR-Net seems to improve its performance, but it does

come at a cost. Rastegari et al. report a theoretical speedup of 64× over traditional DNNs. This

comes from a simple calculation that 1-bit operations should be 64× faster than 64-bit floating-

point operations. While this is not accurate, as described in Section 3.4.3, they do not take into

consideration the computations required to calculate the gain term. XNOR-Net must calculate L1-

norms for all convolutions during training and inference. The rest of the works presented in this

section make mention of this. While a gain term helps improve the accuracy of BNNs, the manner

in which XNOR-Net computes gain terms is costly.

44

Rastegari et al. point out that by placing the pooling layer after the dot product layer (FC

or convolutional layer) rather than after the activation layer, training is improved. This allows the

max pool layer to look at the signed integer values out of the dot product instead of the binary

values out of the activation. A max pool of binary values would have no information about the

magnitude of the inputs to the activation, thus the gradient is passed to all activations with a +1

value rather than the largest one before binarization.

3.5.3 DoReFa-Net

Zhou et al. try to generalize quantization and take advantage of bitwise operations for fixed-

point data with widths of various sizes [38]. They introduce DoReFa-Net, a model with a variable

width size for weights, activations, and even gradient calculations during backpropagation. Zhou

et al. emphasized speeding up training time instead of just inference.

DoReFa-Net claims to utilize bitwise operations by breaking down dot products of fixed-

point values into multiple dot products of binary values. However, the complexity of their bitwise

operations is O(n2) where n is the width of the data used, which is not better than fixed point dot

products.

Zhou et al. point out the inefficiencies of XNOR-Net’s method for calculating a gain term.

DoReFa-Net does not dynamically calculate gain terms using the L1-norm of both activations and

weights. Instead, gain terms are only based on the weights of the network. This allows for efficient

inference since the weights and gain terms do not change after training.

The topology of DoReFa-Net is used throughout the BNN literature which is explained in

Section 3.8.2.

3.5.4 Tang et al.

Tang et al. [60] present multiple ideas for BNNs that are used by others. They do not

present a new topology but binarize AlexNet and focus on classification accuracy on the ImageNet

dataset.

To speed up training, Tang et al. study how the learning rate affects the rate of improvement

in the network and how it affects the rate at which binary values oscillate between −1 and +1. For

45

a given learning rate, the sign of the weights in a BNN oscillates much more frequently than in a

full-precision network. The number of sign changes in a BNN is orders of magnitude more than in

a traditional network. Tang et al. show better training in BNN when small learning rates are used.

Tang et al. take advantage of a gain term in their network and point out the inefficient

manner in which XNOR-Net uses a gain term. They propose to use a learned scaling factor by

using a PReLU layer in their network. As opposed to the ReLU layer which zeros out negative

inputs, PReLU learns a positive gain term to apply to the negative input values. This gain is applied

indirectly within the PReLU.

Tang et al. notice the bulky nature of the FC layers used in previous BNN implementations.

FC layers perform much larger dot products than convolutional layers. In a convolutional layer,

many small dot products are calculated. FC layers perform a single large dot product which is

much larger than those used in convolutional layers. In BNNs, whole values (−1 and +1) are used

instead of the fractional values seen in traditional DNNs. Tang et al. point out that this can lead to

a large range of possible values for the final FC layer, which does not play nicely with the softmax

function used in classification.

Previous works get around this by leaving the final layer at full precision instead of binariz-

ing it. Tang et al. binarize the last layer and introduce a learned gain term at every neuron. With a

binarized last layer, the BNN becomes much more compact since most of the model’s parameters

lie in the FC layers.

To help generalization, Tang et al. emphasize the importance of a regularizer. This is the

first instance of a regularizer used during the training of a BNN that we could find. They also use

multiple bases for binarization which is discussed in Section 3.6.2.

3.5.5 ABC-Net

The ABC-Net model is introduced in [52] by Lin et al. ABC-Net tries to reconcile the

accuracy gap between BNNs and full precision networks. ABC-Net generalizes some of the multi-

bit ideas in DoReFa-Net and the gain terms learned by the network in [60].

ABC-Net binarizes activations and weights into multiple bases. For weights, the binariza-

tion function is given by

Bwi = sign(W +uistd(W )) (3.4)

46

where W is the set of weights being binarized, W =W−mean(W ), std(W ) is the standard deviation

of W and ui is a learned term. A set of Bi binarizations are produced according to the learned

parameters ui. This binarization function centers the weights W around zero and produces different

binarizations according to different threshold biases (uistd(W )).

These binarized linear bases can be used in bitwise dot products with the activations. The

results are then combined in a linear combination with learned gain terms. This technique is

reminiscent of the multi-bit method proposed in DoReFa net, but instead of using the slices from

the powers of two, bases are based on learned bias that act as thresholds. This requires more

calculations, but offers better accuracy than DoReFa-Net for the number of bitwise operations

used. It also uses learned gain terms in the linear combination of the bases which is a more general

use of a gain term than just in a PReLU layer like Tang et al. [60].

The binarization of the weights is aided by calculating the mean and standard deviation of

the weights. Once the weights are learned, there is no need to calculate the mean and standard

deviation again during inference. If the same method were used on the activations, the network

would suffer from a similar problem as XNOR-Net where these values would need to be calculated

again during inference. Instead, ABC-Net makes multiple binarized bases for the activations using

a learned threshold bias without the aid of the mean or standard deviation with

BAi = sign(A+ui) (3.5)

where BAi is the binarized base for the set of activations A and ui is learned threshold bias. A gain

term is learned which is associated with each activation base.

Each binary activation base can be combined with each binary weight base in a dot product.

ABC-Net takes advantage of efficient gain terms and multiple biases, but the computation cost is

higher for each base that is added.

The ABC-Net method is applied to various sizes of ResNet topologies and shows only a

3.3% drop in top-5 accuracy on ImageNet compared to full 32-bit precision when using 5 bases

for both activations and weights.

47

3.5.6 BNN+

Darabi et al. extend some of the core principles of the original BNN by looking at al-

ternatives to the STE used by all previous BNNs. The STE simply uses an identity function for

the backpropagation of the gradient though the signed activation layer. Combining this with the

need to kill gradients of large activations (see Section 3.4.2), the backpropagation of gradients

through sign activation function can be viewed as an impulse function which clips the gradient if

the activation has an absolute value greater than 1.

The BNN+ methodology improves learning through a different effective backpropagation

function in place of the impulse function. Instead of the impulse function, a variation of the

derivative of the Swish-like activation (swish ref) is used:

dSSβ (x)dx

=β (2−βxtanh(βx

2 ))

1+ cosh(βx)(3.6)

where β can be a hyperparameter set by the user or a learned parameter set by the network. Darabi

et al. state that this type of function allows for better training since it is differential instead of

piecewise at −1 and +1.

Another focus of the BNN+ methodology is a regularization function that helps force the

weights towards −1 and +1. They compare two approaches, one that is based on the L-1 norm

R1(w) = |α−|w|| (3.7)

and another that is based on the L-2 norm.

R2(w) = (α−|w|)2 (3.8)

When α = 1 this regularizer is minimized when weights are closer to −1 and +1. The network is

allowed to learn this parameter.

In addition to these new techniques, BNN+ uses a gain term. It is notable that multiple

bases are not used. Compared to other single base techniques, BNN+ reports the best accuracies

on ImageNet, but does not perform quite as well as ABC-Net.

48

3.5.7 Comparison

Here we compare the methods reviewed in this section. Table 3.2 summarizes the accura-

cies of these methods on the CIFAR-10 dataset. Table 3.3 compares accuracies on the ImageNet

dataset. See Section 3.7 for further accuracy comparisons of BNNs. Table 3.4 compared the fea-

tures of each method. The results are listed in each table in chronological order of when they were

published.

It is interesting to note that the results reported by Courbariaux et al. [28] on the CIFAR-10

dataset for the original BNN method achieves the best performance. Most of the work since [28]

has focused on improving results on the ImageNet dataset.

Table 3.2: Comparison of accuracies on the CIFAR-10 dataset from works presented in this section.

Methodology Topology Accuracy (%)

Original BNN BNN 89.85XNOR-Net BNN 89.83

BNN+ AlexNet 87.16BNN+ DoReFa-Net 83.92

3.6 Improving BNNs

Several techniques for improving the accuracy of BNN were reviewed throughout the last

section. We will now cover each technique individually.

3.6.1 Scaling with a Gain Term

Binarization only allows information about the sign of inputs to be passed to the next layers

in the network while information about the magnitude of the input is lost. The resulting values are

either −1 or +1. This allows for efficient computation using bitwise dot product operations at a

cost of lost information in the network. Gain terms can be used after the bitwise dot products have

occurred to give the output a sense of magnitude. Many works on BNN point out that this allows

for a binarization with values of−γ and +γ , where γ is the gain term. This lends the network more

49

Table 3.3: Comparison of accuracies on the ImageNet dataset from works presented in this section.Full precision network accuracies are included for comparison as well.

Methodology Topology Top-1 Accuracy (%) Top-5 Accuracy (%)

Original BNN AlexNet 41.8 67.1Original BNN GoogleNet 47.1 69.1

XNOR-Net AlexNet 44.2 69.2XNOR-Net ResNet18 51.2 73.2DoReFa-Net AlexNet 43.6 -Tang et al. 51.4 75.6ABC-Net ResNet18 65.0 85.9ABC-Net ResNet34 68.4 88.2ABC-Net ResNet50 76.1 92.8

BNN+ AlexNet 46.11 75.70BNN+ ResNet18 52.64 72.98

Full Precision AlexNet 57.1 80.2Full Precision GoogleNet 71.3 90.0Full Precision ResNet18 69.3 89.2Full Precision ResNet34 73.3 91.3Full Precision ResNet50 76.1 92.8

Table 3.4: A table of major details of the methods presented in this section. Activation refersto which kind of activation function is used. Gain describes how gain terms were added to

the network. Multiplicity refers to how many binary convolutions were performed inparallel in place of full precision convolution layers. The regularization column

indicates which kind of regularization was used, if any

Methodology Activation Gain Multiplicity Regularization

Original BNN Sign Function None 1 NoneXNOR-Net Sign Function Statistical 1 NoneDoReFa-Net Sign Function Learned Param. 1 NoneTang et al. PReLU Inside PReLU 2 L2ABC-Net Sign w/Thresh. Learned Param. 5 None

BNN+ Sign w/SSt for STE Learned Param. 1 L1 and L2

50

complexity if used correctly. If −γ and +γ are fed directly into another sign activation function

centered at 0, the gain term would have no effect since sign(+/− γ) = sign(+/−1). BNNs with

BN layers can avoid this pitfall since a bias term is built in. See Section 3.6.7 for more details on

the combination of BN and the sign activation function.

Gain terms can be used to give more capacity to a network when multiple gain terms are

used within a dot product or to form a linear combination of parallel dot products. Instead of simply

changing the values used in the binarization from +1 and −1 to +γ and −γ , different weights can

be added to elements within the binary dot product to make it act more like a full precision dot

product. This is what is done with XNOR-Net [58]. XNOR-Net uses magnitude information to

form a gain term for both the weights and activations. Every “pixel” in a tensor of feature maps

has its own gain term based on the magnitude of all the channels at that “pixel”. Every “kernel”

also has its own gain term. However, as detailed in Section 3.5.2 this is not very efficient. A full

precision convolution is required since the gain of every “pixel” acts as full precision weight.

Instead of using gain terms within the dot products like XNOR-Net, other works use gains

to perform a linear combination between parallel dot products. Some networks use gain terms that

are extracted from the statistics of the inputs [38, 60], and others learn these gain terms [52, 61].

The idea of parallel binary dot products that are combined in a linear combination is often referred

to as multiple bases, which is covered in the next section.

3.6.2 Using Multiple Bases

Multiple binarizations of a single set of inputs have been used to help with the loss of

information during binarization in BNNs. These multiple binarizations can be seen as bases that

can be combined to form a result with more information. Efficient dot products can still be used in

computing outputs, but extra computations are needed to combine the multiple bases together.

DoReFa-Net [38] breaks inputs into bases where each base corresponds to a power of two.

There is one binary base for each power of two needed to represent the data being binarized. The

number of bases needs to match the number of fixed-point bits of precision in the input. DoReFa-

Net uses fewer bits of precision in the data used when less precision is desired. This led to no

loss in information compared to the input and gives the appearance of more efficient computations.

51

However, this technique does not save any computations over regular fixed-point multiplication

and addition.

Another technique is to compute the residual error between a scaled binarization and its

input, then compute another scaled binarization based on that error. This type of binarization is

known as residual binarization. Tang et al. [60] both ReBNet [62] use residual binarizations (which

should not be confused with residual layers in DNNs). Compared to DoReFa-Net, the gain term

for each base is dependent on the magnitude of input value or residual. Information is lost, but

the first bases computed hold more information. This is a more efficient use of the bases than the

straightforward fixed-point base-two method of DoReFa-Net. Floating-point values can be used

and are preferable in such a scheme that is more suitable for GPUs and CPUs that are optimized

for floating-point computations. However, more computations are needed in order to calculate the

residuals of the binarization, a similar problem to XNOR-Net, but on a smaller scale since this is

being done for a handful of bases instead of every “pixel” in a feature map.

Using information from activations in order to compute multiple bases requires more com-

putations, as seen in [60, 62]. ABC-Net [52] simply learns various bias terms for thresholding and

different gain terms for scaling bases when computing activations. By allowing the network to

learn these values instead of computing them directly, there is no extra computations required at

inference time. ABC-Net still uses magnitude information from the weights during training, but

since weights are set after training, constant values are used during inference.

3.6.3 Partial Binarization

Instead of binarizing the whole network, a couple of methods have been proposed to bi-

narize on the parts of the network that benefit the most from the compression and keep the most

essential layers at full precision. The original BNN method and most other BNNs do in fact use

partial binarization since the last layer is kept at a higher precision to achieve the results that they

do. Tang et al. [60] propose a method for overcoming this (see Section 3.5.4).

Other networks have embraced this idea, sacrificing efficiency and model compression for

better accuracy by increasing the number of full precision layers. TaiJiNet [60] divides the kernels

of a convolutional layer in two groups, more important kernels that will not be binarized and

another group of kernels that will be binarized. To determine which kernels are more important,

52

TaiJiNet looks at combinations of statistics using L1 and L2-norms, mean, standard deviation and

variance.

Prabhu at al. [63] also explored partial binarization. Instead of separating out individual

kernels in a convolutional layer, each layer is analyzed as a whole. Every layer in the network is

given a score, then clustering is done to find an appropriate threshold that will split low scores from

high scores deciding which layers will be binarized and which other ones will not.

Wang et al. [64] use partial binarization during the training for better accuracy. The network

is gradually binarized as the network is trained. The method goes against the original method of

Courbariaux et al. [28] where binarization during training was desired. However, Wang et al. argue

that introducing binarization gradually during training helps achieve better accuracy.

Partial binarization is well suited for software-based systems where control is not dictated

by a hardware layout. Full binarization may not take full advantage of the available resources

of a system while a full-precision network may prove to be too difficult. Partial binarization can

be customized to a system but requires extra effort in selecting what parts to binarize. Partial

binarization decisions would need to be application and system specific.

3.6.4 Learning Rate

It has been shown by experience that BNNs take longer to train than full precision networks.

While traditional DNNs can make small adjustments to their weights during optimization, BNNs

update real-valued weights that may or may not affect change in the output of the loss function.

These real-valued weights can be thought of as quasi accumulators that hold a running total of the

gradient for each binary weight. It takes an accumulation of gradient steps to change the sign of

the real-valued weight and thus change the binary weight.

In addition, most of the weights in BNNs converge to either positive or negative [61]. The

binary weights do not change even through backpropagation the optimizer dictates a step in that

same direction. Many of the gradients that are calculated never have any effect on the loss and

never improve the network. For these reasons Tang et al. suggest that a higher learning rate should

be used to speed up training [60].

53

3.6.5 Padding

In DNNs, convolutions are often padded with zeros to help make the topology and data

flow more uniform. This is standard practice for convolutional layers. With BNNs however, using

zero padding adds a third value along with−1 and +1. Since there is no binary encoding for 0, the

bitwise operations are not compatible. We found that this is overlooked in many of the available

software source code provided by authors. Several works focusing on digital design and FPGAs

( [65–67]) point out this problem. Zhao et al. [65] experiment with the effects of just using +1

for padding. Fraser et al. [67] use −1 for padding and report that it works just as well as zero

padding. Guo et al. [66] explore this problem in detail and claim that simple padding with either

−1 or +1 hurts the accuracy of the BNN. Instead, they test alternating padding where the border

is padded with −1s and +1s at every other location. This method improves accuracy, but only

slightly. To help even further, they alternate which value they begin the padding from one channel

to the next. At every location where a −1 for padding in odd-numbered channels, a +1 is used in

even-numbered channels and vice versa. This helps the BNN network achieve accuracy similar to

a zero-padded network.

3.6.6 More Binarization

The original BNN methodology is not completely binarized. As mentioned in Section

3.6.5, the convolutional padding scheme used by the original open-source BNN software imple-

mentation uses zero-padding which introduces 0 values into the network. This turns the network

into a ternary network instead of a binary network. Some hardware implementations get around

this by not using padding at all. The methods mentioned in Section 3.6.5 allows for complete use

of bitwise operations and padding leading to faster run times than with networks that involve 0

values.

Another part of most BNN models that are not completely binary is the first and last layers.

The FBNA [66] methodology focuses on making the BNN topology completely binary. This

includes binarizing the first layer. Instead of using the fixed precision values from the input data,

they use a scheme similar to DoReFa-Net [38] where a smaller bit width is used for the values and

the small bit-width values are split into multiple binarizations without losing precision. Instead of

54

using a base for each power of two used to represent the data, they use as many bases as needed to

be able to sum binary values to get the original value. This seems to be a less efficient technique

since n2 bases are needed for n bit of precision.

Tang et al. [60] introduce a method for binarizing the final FC layer of a network, which is

traditionally left at higher precision. They use a learnable channel-wise gain term within the dot

product to allow for manageable numbers instead of large whole values. Details are provided in

Section 3.5.4.

3.6.7 Batch Normalization and Activations as a Threshold

The costly calculation of the BN layer may seem to contradict the efficiency of the BNN

methodology. However, implementing a BN layer in the forward pass is arithmetically equivalent

to an integer threshold in BNNs. The BN layer computes

BN(x) =x−µ√σ2 + ε

∗ γ +β (3.9)

where x is the input, µ is the mean value of the batch, σ2 is the variance of the batch, ε is added

for numerical stability and γ and β are learned gain and bias terms. Since the activations simply

calculate sign(BN(y)),

sign(y) =

+1 x≥ τ

−1 x < τ

(3.10)

where

τ =−β√

σ2 + ε

γ+µ (3.11)

This is only true if γ is positive. For negatively valued γ , the same comparator would be

used, but x would be negated. Since integer values are produced by the dot product in a BNN, τ

can be rounded appropriately to an integer.

This method is very useful in digital designs where thresholding is much simpler than the

explicit arithmetic required for BNN layers during training.

55

3.6.8 Layer Order

As pointed out in Section 3.5.2, BNNs can be better trained if the pooling layer is placed

before the BN/activation layer. However, in the forward pass, there is no difference in the order of

these particular layers. Umuroglu et al. [68] point out that execution is faster during inference if

the pooling layer comes after the activation. That way binary values can be used, eliminating the

need for comparisons in the max-pooling layer. If any of the inputs is +1, the output is +1. An

OR function can be used on the binary encodings within the network to achieve max pooling.

3.7 Comparison of Accuracy

In this section, we present a comparison of the accuracy of BNNs on a few different

datasets. The accuracies are associated with a specific publication and only include the authors’

self-reported accuracies, not accuracies other authors reported as a comparison.

3.7.1 Datasets

Four major datasets are used to test BNN. We include results for these four datasets. Various

publications exist for specialized applications of BNNs on specific datasets [18, 45, 69–76].

MNIST: A simple dataset of handwritten digits. The images are only 28× 28 pixel

grayscale images with 10 classes to classify. This data set is fairly easy, and FC layers are used

to classify these images instead of CNNs. Courbariaux et al. do this in their original work on

BNNs [28] claiming that is harder with FC layers and is a better test of the BNNs capabilities. The

dataset contains 60,000 images for training and 10,000 for testing.

SVHN: The Street View House Numbers dataset. A dataset of photos of single digits (0–9)

taken from photos of house numbers. These color photos are centered on a single digit. The dataset

contains just under 100,000 32×32 images classified into 10 classes.

CIFAR-10: A dataset of 60,000 32× 32 photos. Contains 10 different classes, 6 different

animals, and 4 different vehicles.

ImageNet: Larger photographs of varying sizes. These images are usually resized to a

common size before processing. Contains images from 1000 different classes. ImageNet is a

much larger and more difficult dataset than the other three datasets mentioned. The most common

56

version of this dataset, from 2012, contains 1.2 million images for training and 150,000 images for

validation.

3.7.2 Topologies

While BNN methods can be applied to any topology, many BNNs compared in the literature

are binarizations of common topologies. We list the topology of networks used as we compare

methods. Papers that did not specify which topology was used are denoted with NK in the topology

column throughout this chapter. For topologies that were ambiguous, we provide as much detail

as was provided by the authors.

All the topologies compared in this section are feedforward topologies. They are either

described by their name if they are well established in the literature (like AlexNet or ResNet) or

we describe them layer by layer with our own notation.

Our notation is similar to others used in the literature. Layer are specified in order. Three

types of layers can be specified: Convolutional layers, C; fully connected layers, FC; and max-

pooling layers, MP. BN layers and activations are implied and are not listed. The number of

output channels of a layer is listed directly after the type of layer. For example, FC1024 is a fully

connected layer with 1024 output channels. The number of input channels can be determined by

the output of the last layer or the size of the input image or the number of activations produced by

the last max-pooling layer. All max-pooling layers have a receptor size of 2× 2 pixels and a stride

of 2. Duplicates of layers also occur often. We list the multiplicity of layers before the type. Two

convolutional layers with 256 output channels could be listed as C256-C256, but we use 2C256

instead.

To better understand and compare the accuracies in this section, we provide a description

of the common topologies used by BNN that are not well known otherwise. We refer to these

common topologies in the comparison tables in this section.

We will refer to the topology of the convolutional BNN proposed by Courbariaux et al. [28]

and used on the SVHN and CIFAR-10 datasets as BNN. It is a variation of a VGG-11 topology

with the following structure: 2C128-MP-2C256-MP-2C512-MP-2FC1024-FC10 as seen in Figure

3.2. Other networks use this same topology but reduce the number of channels by half. We denote

these as 1/2 BNN.

57

Figure 3.2: Topology of the original Binarized Neural Networks (BNN). Numbers listed denote thenumber of output channels for the layer. Input channels are determined by the number of channelsin the input, usually 3, and the input size for the FC layers.

58

We will refer to a common three-layer MLP used as MLP with the following structure:

3FC1024-FC10. 4xMLP will denote an MLP with 4 times as many hidden channels (3FC4096-

FC10).

Some works mention the DoReFa-Net topology. The original DoReFa-Net paper does not

outline any specific topology but instead outlines a general methodology [34]. We suspect that pa-

pers that claim to use the DoReFa-Net topology use a software implementation of DoReFa-Net like

the one included in Tensorpack for Tensorflow, which may be a binarization of a popular topology

like AlexNet. However, since we do not know for sure, we denote these entries as DoReFa-Net.

3.7.3 Table of Comparisons

Seven tables are included in this section to report the accuracies of different BNN method-

ologies for the MNIST (Tables 3.5 and 3.6), SVHN (Tables 3.7 and 3.8), CIFAR (Tables 3.9 and

3.10) and ImageNet (Tables 3.11) datasets. We also report the accuracies of non-binary networks

that are related, like partial binarized networks and BinaryConnect, which preceded BNNs.

Accuracies on MNIST

Table 3.5: BNN accuracies on the MNIST dataset. The accuracy reported for [77] was not explic-itly stated by the authors. This number was inferred from the figure provided.

Source Accuracy(%) Topology

[78] 95.7 FC200-3FC100-FC10[28] 96.0 MLP[77] 97 NK[79] 97.0 LeNet[80] 97.69 MLP[81] 97.86 ConvPool-2[62] 98.25 1/4 MLP[68] 98.4 MLP[82] 98.40 MLP[83] 98.6 NK[84] 98.67 MLP[85] 98.77 FC784-3FC512-FC10

59

Table 3.6: Accuracies on the MNIST dataset of non-binary networks related to works reviewed.

Source Accuracy(%) Topology Precision

[41] 95.15 NK Ternary values[86] 96.9 NK 8-bit values[86] 97.2 NK 12-bit values[84] 98.53 MLP 2-bits values[46] 98.71 BinaryConnect deterministic 32-bit float activations[80] 98.74 MLP 32-bit float[46] 98.82 BinaryConnect stochastic 32-bit float activations[39] 99.1 NK Ternary values

Accuracies on SVHN

Table 3.7: BNN accuracies on the SVHN dataset.

Source Accuracy Topology

[80] 94.9 1/2 BNN[68] 94.9 1/2 BNN[66] 96.9 NK[62] 97.00 C64-MP-2C128-MP-2C256-2FP512-FP10[38] 97.1 DoReFa-Net[28] 97.47 1/2 BNN

Table 3.8: Accuracies on the SVHN dataset of non-binary networks related to works reviewed.

Source Accuracy(%) Topology Precision

[42] 97.60 1/2 BNN Ternary weights[42] 97.70 BNN Ternary weights[46] 97.70 BinaryConnect - deterministic 32-bit float activations[46] 97.85 BinaryConnect - stochastic 32-bit float activations

60

Table 3.9: BNN accuracies on the CIAFR-10 dataset.

Source Accuracy(%) Topology Disambiguation

[87] 66.63 2 conv. and 2 FC[67] 79.1 1/4 BNN[68] 80.1 1/2 BNN[80] 80.1 1/2 BNN[40] 80.4 VGG16[88] 81.8 VGG11[83] 83.27 NK[67] 88.3 BNN[61] 83.52 DoReFa-Net R2 regularizer[61] 83.92 DoReFa-Net R1 regularizer[81] 84.3 NK[67] 85.2 1/2 BNN[89] 85.9 6 conv.[79] 86.0 ResNet-18[90] 86.05 9 256-ch conv.[87] 86.06 5 conv. and 2 FC[91] 86.78 NK[62] 86.98 C64-MP-2C128-MP-2C256-2FC512-FC10[77] 87 NK[61] 87.16 AlexNet R1 regularizer[61] 87.30 AlexNet R2 regularizer[65] 87.73 BNN +1 padding[59] 88 BNN 512 channels for FC[65] 88.42 BNN 0 padding[85] 88.47 6 conv.[66] 88.61 NK[28] 89.85 BNN

Table 3.10: Accuracies on the CIFAR-10 dataset of non-binary networks related to works reviewed.

Source Accuracy(%) Topology Precision

[40] 81.0 VGG16 Ternary values[40] 82.9 VGG16 Ternary values[42] 86.71 1/2BNN Ternary values[42] 89.39 BNN Ternary values[46] 90.10 BinaryConnect - deterministic 32-bit float activations[46] 91.73 BinaryConnect -stochastic 32-bit float activations

61

Table 3.11: BNN accuracies on the ImageNet dataset.

Source Top 1 Acc.(%) Top 5 Acc.(%) Topology Details

[48] 36.1 60.1 BNN AlexNet[38] 40.1 Alexnet[62] 41.43 Details in [62][57] 41.8 67.1 BNN AlexNet[58] 44.2 69.2 AlexNet[38] 43.6 Alexnet Pre-trained on full precision[61] 45.62 70.13 AlexNet R2 reg[61] 46.11 75.70 AlexNet R1 reg[48] 47.1 69.1 BNN GoogleNet[63] 48.2 71.9 AlexNet Partial binarization[58] 51.2 73.2 ResNet18[61] 52.64 72.98 ResNet-18 R1 reg[61] 53.01 72.55 ResNet-18 R2 reg[92] 54.8 77.7 ResNet-18 Partial binarization[93] 55.8 78.7 AlexNet Partial binarization[52] 65.0 85.9 ResNet-18 5 bases[52] 68.4 88.2 ResNet-34 5 bases[52] 70.1 89.7 ResNet-50 5 bases[94] 75 VGG 16[60] 75.6 51.4 AlexNet binarized last layer

Accuracies on CIFAR-10

Accuracies on ImageNet

3.8 Hardware Implementations

3.8.1 FPGA Implementations

FPGAs are a natural platform for BNNs when performing inference. BNNs take advan-

tage of bitwise operations when performing dot products. While CPUs and GPUs are capable

of performing these operations, they are optimized for a range of tasks, especially integer and

floating-point operations. FPGAs allow for custom data paths. They specifically allow for hard-

ware architectures optimized around the XNOR and popcount operations. FPGAs are generally

low-power platforms compared to CPUs, and especially GPUs. They usually have smaller plat-

forms than GPU.

62

3.8.2 Architectures

FPGA DNN architectures usually fall under one of two categories, streaming architectures

and layer accelerators. Steaming architectures have dedicated hardware for all or most of the layers

in a network. These types of architectures can be pipelined, where each stage in the architecture

can be processing different input samples. This usually offers higher throughput, reasonable la-

tency, and requires less memory bandwidth. They do require more resources since all layers of the

network need dedicated hardware. These types of architectures are especially well suited for video

processing. This style is the most commonly found throughout the literature.

Layer accelerators provide modules that can handle only a specific layer of a network.

These modules need to be able to handle every type, size, and channel width of input that may be

required of it. Results are stored in memory to be fed back into the accelerator for the next layer

that will be processed. These types of architectures do not require as many resources as streaming

architectures but have much lower throughput. These types of architectures are well suited for

constrained resource designs where high throughput is not needed. The feedback nature of layer

processors also makes them well suited for RNNs, as seen in [95].

FPGAs typically include digital signal processors (DSPs) and block memory (BARMs)

built into the logic fabric. DSPs can be vital in full precision DNN implementations on FPGAs

where they can be used to compute multi-bit dot products. In BNNs however, dot products are

bitwise operations and DPSs are not used as extensively. Nakahara et al. [96] show the effective-

ness of in-fabric calculation in BNNs over methods that use DSPs. BRAMs are used in BNNs to

store activations, weights, and other parameters. They offer storage for sliding windows used in

convolutions.

CPU-FPGA hybrid systems offer a CPU and FPGA connected in the same silicon. These

systems are widely used to implement DNNs and BNNs [62,65,66,68,80,87–89,94,97]. The CPU

is flexible and easily programmed to load inputs to the DNN. The FPGA can then be programmed

to execute the BNN architecture without extra logic for input and output processing.

63

3.8.3 High-Level Synthesis

FPGAs can be difficult to program for those who do not have specialized training. To help

software programmers without experience with hardware design, tool flows have been designed

where programmers can write a program in a language like C++ which is then synthesized into an

FPGA hardware design. This type of workflow is called High-Level Synthesis (HLS). HLS has

been a major component of the research done with BNNs on FPGAs [62, 65–67, 80, 82, 89, 94, 95,

98].

Yaman Umuroglu et al., from the Xilinx Research Labs, provided a specific workflow

designed for training and implementing BNNs called FINN [68]. Training of a BNN is done with

a deep learning library. The trained model is then used by FINN to produce code for the BNN

which it synthesizes into a hardware design by Xilinx’s HLS tool. The FINN tool received an

extension allowing it to work with BNN topologies for LSTMs [95]. Xilinx Research labs also

extended the capabilities of FINN by allowing it to work with multi-bit quantized networks, not

just with BNNs [80].

3.8.4 Comparison of FPGA Implementations

We provide a comparison of BNN implementations in FPGA platforms in Table 3.12.

Details regarding accuracies, topologies, FPGA usage, and FPGA execution are given. Note

that [62, 89] were the only works that reported significant DPS usage and DSP usage was left

out of Table 3.12.

3.8.5 ASICs

While FPGAs are well suited for processing BNNs and take advantage of their efficient

bitwise operations, custom silicon designs, or ASICs, have the potential to provide the ultimate

power and computational efficiency for any hardware design. FPGAs fabrics can be configured

for BNN topologies, but the physical layout of FPGAs never changes. The fabric and resources

are made to fit a wide variety of applications. Hardware layout in ASIC designs can be changed

to fit the specifications for BNNs. Bitwise operations can be even more efficient in ASIC designs

than they are in any other platform [70, 71, 83, 90, 99–101]. ASIC designs can integrate image

64

Tabl

e3.

12:

Com

pari

son

ofFP

GA

impl

emen

tatio

ns.

The

accu

raci

esre

port

edfr

om[9

4,96

]w

ere

note

xplic

itly

stat

ed.

The

senu

mbe

rsw

ere

infe

rred

from

the

figur

espr

ovid

ed.

The

accu

racy

for

[94]

isas

sum

edto

bea

top-

5ac

cura

cyan

dth

eac

cura

cyfo

r[6

2]is

assu

med

toto

p-1

accu

racy

,but

this

was

neve

rst

ated

byth

eir

resp

ectiv

eau

thor

s.D

atas

ets:

MN

IST

=M

N,S

VH

N=

SV,

CIF

AR

-10

=C

I,Im

ageN

et=

IN.

Sour

ceD

atas

etA

cc.

Topo

logy

FPG

AL

UT

sB

RA

Ms

Clk

FPS

Pow

er(%

)(M

Hz)

(W)

[80]

MN

97.6

9M

LP

Zyn

q702

025

358

220

100

2.5

[80]

MN

97.6

9M

LP

Zyn

qUltr

a3E

G38

205

417

300

11.8

[62]

MN

98.2

5Se

eTa

ble

3.5

Spar

tan7

5032

600

120

200

[68]

MN

98.4

ML

PZ

ynq7

045

8298

839

615

6100

022

.6[8

2]M

N98

.40

ML

PK

inte

x732

5T40

000

110

100

1000

012

.22

[68]

SV94

.91/

2B

NN

Zyn

q704

546

253

186

2190

011

.7[6

6]SV

96.9

6C

onv/

3FC

Zyn

q702

029

600

103

6451

3.2

[62]

SV97

.00

See

Tabl

e3.

7Z

ynq7

020

5320

028

020

0

[87]

CI

66.6

32

Con

vs/2

FCZ

ynq7

045

2026

4[9

6]C

I78

See

Tabl

e3.

9V

erte

x769

0T20

352

372

450

15.4

4[6

7]C

I79

.11/

4BN

NK

inte

xUltr

a11

535

818

144

125

1200

0[8

0]C

I80

.10

1/2B

NN

Zyn

qUltr

a3E

G41

733

283

300

10.7

[80]

CI

80.1

01/

2BN

NZ

ynq7

020

2570

024

210

02.

25[6

8]C

I80

.11/

2BN

NZ

ynq7

045

4625

318

621

900

11.7

[88]

CI

81.8

1/2B

NN

Zyn

q702

014

509

3214

342

02.

3[6

7]C

I85

.21/

2BN

NK

inte

xUltr

a11

593

755

386

125

1200

0[8

9]C

I85

.9Se

eTa

ble

3.9

Zyn

q702

023

426

135

143

930

2.4

[87]

CI

86.0

65

Con

vs/2

FCV

erte

x798

0T55

6920

340

3321

58[6

2]C

I86

.98

See

Tabl

e3.

9Z

ynq7

020

5320

028

020

0[6

5]C

I87

.73

See

Tabl

e3.

9Z

ynq7

020

4690

014

014

34.

7[6

7]C

I88

.3B

NN

Kin

texU

ltra

115

3929

4718

1412

512

000

[66]

CI

88.6

16

Con

v/3

FCZ

ynq7

020

2960

010

352

03.

3

[62]

IN41

See

Tabl

e3.

11V

irte

xUltr

a09

510

7520

034

5620

0[9

4]IN

75V

GG

116

Zyn

qUltr

a9E

G19

1784

1367

150

31.4

822

65

sensors [102] and other peripheral elements into their design for fast processing and low latency

access.

Nurvitadhi et al., from Intel’s Accelerator Architecture Lab, designd a layer accelerator for

BNNs in an ASIC [99]. They compare the execution performance of the ASIC implementations

with implementations in an FPGA, CPU, and GPU. They show that power can be significantly

lower in an ASIC while maintaining the fastest execution times.

Since BNNs require a large number of parameters, like most DNNs. A handful of ASIC

designs focus on in-memory computations [94,103–106]. Custom silicon also allows for the use of

mixed technologies and memory-based designs in resistive RAM (RRAM). RRAM is an emerging

technology and is an appealing platform for BNN designs due to its compact operation at the bit

level [85, 86, 103, 107, 108].

3.9 Conclusions

BNNs can provide substantial model compression and inference speedups over traditional

DNNs. BNNs do not achieve the same accuracy as their full precision counterparts, but improve-

ments have been made to close this gap. BNNs appear to be better replacements for smaller

networks rather than larger ones, coming within 4.3% top-1 accuracy for the small ResNet18 but

6.0% top-1 accuracy on the larger ResNet50.

The use of multiple bases, learned gain terms, learned bias terms, intelligent padding, and

even partial binarization have helped to make BNNs accurate while still executing at high speeds

maintaining small sizes. These speeds have been accelerated even further as BNNs have been

implemented in FPGAs and ASICs. New tool flows like FINN have made programming BNNs on

FPGA accessible to more designers.

66

CHAPTER 4. USING FULL PRECISION METHODS TO SCALE DOWN BINARIZEDNEURAL NETWORKS

Most of the developments made to BNNs focus on scaling up BNNs to handle large and

complex image classification datasets. However, even the smallest BNNs proposed in the literature

do not fit on stand-alone mid-sized FPGAs [67]. Many proposed improvements to BNNs move

away from hardware and FPGA-friendly operations [58].

This chapter and Chapter 5 explore techniques for scaling down BNNs in order to im-

plement them on resource-limited systems like embedded computers and midsized FPGAs. This

chapter explores methods that already exist for full-precision networks, which we apply to BNNs.

Chapter 5 introduces a novel method specifically designed for BNNs. We find that the methods

explored in this chapter are not particularly effective for scaling down BNNs. We hypothesize that

it is necessary to use methods specifically designed for binary values in order to effectively scale

down BNNs, which motivates our work in Chapter 5.

4.1 Depthwise Separable Convolutions

Depthwise separable convolutional layers were first introduced in [29]. They reduce the

number of required operations by convolving each of the input channels of the convolutional layers

with 2D kernels and then combining them with 1×1 point-wise filters.

Standard convolutional layers use a 3D filter of size K×K×N, where K is the width and

height of the kernel and M is the number of input channels, as shown in Figure 4.1. N number

of these filters are used to produce N output channels. Depthwise separable convolutional layers

use M number of K×K 2D kernels. Each of these kernels is applied to a single input channel. N

outputs channels are formed by combining these convolutions with N 1×1×M pointwise filters,

shown in Figure 4.2.

67

Figure 4.1: Standard convolutional layers use N number of K ×K ×M filters, where M is thenumber of input channels and N is the number of output channels.

Figure 4.2: Depthwise separable convolutional layers use M number of K×K 2D filters, one foreach input channel, and combines them with N number of pointwise filters with a depth of M.

68

4.1.1 Experiments and Results

We tested models that use binarized depthwise separable convolutional layers on the CIFAR-

10 dataset. We constructed models using a VGG style topology, which is common through-

out the literature. The models had the following structure: C128-BN-C128-MP-BN-C256-BN-

C256- MP-BN-C521-BN-C512-MP-BN-FC1024-BN-FC1024-BN-FC10-SM. C represents a con-

volutional layer followed by the number of filters, BN represents a Batch Norm layer, MP repre-

sents 2× 2 max-pooling layers with a stride of 2, FC represents fully connected layers followed

by the number of units, and SM represents the softmax function. The topology is shown in Figure

4.3. We tested the depthwise separable layers against standard BNN convolutional layers. We also

tested models with convolutional layers where only the pointwise convolutional kernel or 3× 3

kernels were binarized.

We saw a significant drop in accuracy when using depthwise separable binarized convolu-

tional layers compared to standard convolutional layers. Figure 4.4 shows out test results. Using

separable filters decreases the accuracy from 80% to 70% and using real values in either the point-

wise or depthwise parts of the separable filter does not offer an increase in accuracy.

4.1.2 Discussion

The degradation in accuracy we see when using the depthwise separable filters is too much

to recommend the use of depthwise separable filters in BNNs. Yang et al. performed similar

experiments for object tracking tasks but did not show any significant advantages when using

depthwise separable BNNs [109]. Our experiments showed that is it essential to use real values in

either the pointwise or depthwise filters. Using standard fully binarized BNN convoultional layers

offers just as much accuracy as only binarizing ether the depthwise or pointwise components of

the separable filters and uses only binary values. We do not recommend using binarized depthwise

separable filters.

4.2 Direct Skip Connections in BNNs

Direct skip connections were first introduced by Huang et al. as part of the DenseNet

model [14]. These types of skip connections are different from the residual skip connections

69

Figure 4.3: The topology to test BNN models with depthwise separable filters. The same topologywas used for both the models with standard convoultional layers and depthwise separable layers.

70

Figure 4.4: The classification accuracy of depthwise separable filters on the CIFAR-10 dataset.Standard BNNs were tested for comparison. We also tested separable filters where only the depth-wise or pointwise parts of the separable filter were binarized. The accuracy is significantly de-creased when using separable filters and using real-value in either separable components do notincrease accuracy over standard BNNs.

introduced in the ResNet model [110]. Instead of adding a previous layer’s outputs to a downstream

layer’s inputs, dense skip connections concatenate a previous layer’s outputs as extra channels to

a downstream layer’s outputs. In addition, direct skip connections connect every layer to all layers

that follow, as shown in Figure 4.5. This allows for gradients to pass freely from every output

layer to every other previous layer. This can help boost the strength of gradient signals to all layers

during backpropagation, which is a weakness of BNNs.

4.2.1 Experiments and Results

We constructed a BNN model that used direct skip connections. We also built a standard

BNN model with the same number of weights as the proposed BNN model. Figure 4.5 illustrates

the proposed connections in contrast to those of standard BNNs.

These connections did not seem to improve large BNNs much. However, they appeared

to effectively allow for channel reduction and downscaling. Figure 4.6 compares classification

71

test accuracy on the CIFAR-10 data set using standard BNN topology compared to our proposed

method using direct skip connections. As the number of base channels in the convolutional layers

was reduced, the accuracy of models with skip connections performed better than models without.

We hypothesize that the skip connections compensate for the limited gradient flow imposed by the

binarization of the network and the reduced number of channels. With direct skip connections,

the gradient information can flow to all layers of the network, passing through fewer binarization

operations.

Figure 4.5: An illustration of standard feedforward connections (left) and direct skip connections(right). Direct skip connections concatenate the outputs of previous convolutions to form the inputto the next layer of convolution. Including these connections help BNNs with a small number ofweights.

72

Figure 4.6: Plots of the test accuracy during training on the CIFAR-10 dataset. Models with skipconnections were compared to models without. Each plot specifies by the number of base convo-lution channels. As the number of convolution channels decreases, models with skip connectionsperform better than the models without skip connections.

4.2.2 Discussion

Our experiments with dense connections in BNNs seemed to show us that the number of

weights and operations in BNNs can be reduced without sacrificing much accuracy. It seemed that

when direct skip connections were present, reducing the number of parameters does not reduce the

accuracy of the network as much as BNNs without direct skip connections. However, there is a

drawback to using these types of connections. The activations from early layers in the network need

to be stored in order to be reused at inputs to every other layer in the network. This increases the

73

memory requirements for the network. Dense skip connections can be used to increase accuracy

on small BNNs, but since it comes with an increased memory cost, we do not propose dense skip

connections as an effective means for implementing BNNs on smaller FPGAs.

74

CHAPTER 5. NEURAL JET FEATURES

Due to the bitwise nature of BNNs, there have been many efforts to implement BNNs on

ASICs and FPGAs [62,65,66], as reviewed in Chapter 3. While BNNs are excellent candidates for

these kinds of resource-limited systems, most implementations still require very large FPGAs or

CPU-FPGA co-processing systems [80, 94]. This chapter focuses on reducing the computational

cost of BNNs even further, making them more efficient to implement on resource-limited systems.

We target embedded visual inspection tasks, like defect detection on manufactured parts and the

sorting of agricultural produce. We propose a new binarized convolutional layer, called the Neural

Jet Features layer. This layer uses deep learning to learn essential classic computer vision kernels

that are efficient to calculate as a group. We show that on visual inspection tasks, Neural Jet

Features perform comparably to standard BNN convolutional layers while using less computational

resources. We also show that Neural Jet Features tend to be more stable than BNN convolutions

layers when training small models.

5.1 Introduction

In Chapter 2 we presented Jet Features [111], a set of convolution kernels that are efficient

to compute on resource-limited systems, which utilize the key scaling and partial derivative prop-

erties of classic image processing kernels. We demonstrated their effectiveness by replacing the

set of traditional image features used in the ECO Features algorithm [19] with Jet Features, which

allowed the algorithm to be effectively implemented on an FPGA for high-speed, low-power clas-

sification without sacrificing image classification accuracy.

In this chapter, we propose a new binarized convoultional layer called the Neural Jet Fea-

tures layer. The Neural Jet Features layer is a convolutional layer that is trained to form Jet Fea-

tures within a BNN using DL methods. This creates a BNN model that is less costly to compute

75

on resource-limited systems while maintaining comparable classification accuracy. We also show

that Neural Jet Features are more stable during training on the MNIST dataset.

5.2 Neural Jet Features

Neural Jet Features are Jet Features that are learned through DL and used in place of stan-

dard binarized convolutional filters. Neural Jet Features require fewer parameters and fewer opera-

tions than the traditional convolutional layers used in BNNs. They learn essential classic computer

vision kernels, combined through deep learning. It is not possible for standard BNN convolutional

layers to learn these classic computer vision features. The results in Section 5.3 show that BNNs

using Neural Jet Features achieve comparable accuracy on certain image classification tasks com-

pared to BNNs using binary conventional layers. Convolutions typically account for the majority

of operations in a BNN, and by using fewer operations to perform convolution, Neural Jet Features

allow BNNs to fit into resource-limited systems, like embedded computers and FPGAs,

The small 2× 1 kernels that make up a Jet Feature, as shown in Figure 5.1, differ only

in orientation and whether their second element contains a -1 or 1. A 3× 3 Jet Kernel is formed

from four of these smaller kernels, thus four binary weights need to be selected to form a 3×3 Jet

Feature.

Figure 5.1: The 2×1 kernels that are used to form Jet Features.

76

5.2.1 Constrained Neural Jet Features

In Chapter 2, we observed that the genetic algorithm selected scaling factors ([1,1]) more

frequently than partial derivatives ([1,−1]) when forming Jet Features. Based on this observation,

we experimented with a constrained version of Neural Jet Features where one of the vertical pa-

rameters and one of the horizontal parameters was forced to be 1, forming scaling factors more

often than partial derivatives. This reduced the computational cost of Neural Jet Features (Section

5.2.3) while increasing the average accuracy in some of our testing compared to unconstrained

Neural Jet Features (Section 5.3). With only two binary parameters to be learned per kernel, there

were only four possible kernels, the Gaussian kernel, vertical and horizontal Sobel kernels, and a

diagonally oriented edge detector, as shown in Figure 5.2. The constrained version of Neural Jet

Features was more efficient than the unconstrained version with comparable or better accuracy.

The constrained form of Neural Jet Features is the proposed form of Neural Jet Features.

5.2.2 Jet Features

5.2.3 Computational Efficiency

Neural Jet Features learn how to combine input channels using these four classical com-

puter vision kernels, shown in Figure 5.2. These kernels have been essential to traditional computer

vision and are often used multiple times within a single algorithm [2, 3]. Since there are only four

possible kernels to be learned, it may make more sense to view a Neural Jet Feature layer as a layer

that learns to combine these four features with the four features of all the other input layers. Even

though there are only four possible features to learn for each input channel, there are N4 unique

ways in which to combine the features of the input channels to form a single output channel, where

N is the number of input channels.

Like traditional convolutional layers, Neural Jet Features reuse weights in ways that em-

ulate traditional computer vision and image processing. Instead of learning separate weights for

every combination of input and output pixel, as done in fully connected layers, convolutional layers

form weights into kernels, that are applied locally and reused throughout the entirety of the input

image, greatly reducing the number of weights to be learned. Similarly, Neural Jet Features also

reuse weights. Neural Jet Features do not learn unique kernels for each and every combination of

77

Figure 5.2: The classic computer vision kernels that can be constructed with ”constrained” NeuralJet Features (see section 5.2.1)

input and output channels. Instead, all four 3× 3 Jet Features are applied to each input channel

and then reused as needed to form the output channels. This reduces the number of operations

required, especially when there are more than just a few output channels.

All four Jet Features are made up of similar 2× 1 convolutions, as shown in Figure 2.3.

Since all four Jet Feature are always calculated for every input channel (and reused as needed),

these convolutions can be effectively calculated as a group. The smaller 2× 1 convolutions that

are common to multiple features can be calculated first and reused to calculate the larger 3×3 Jet

Features. Figure 5.3 shows how these 2×1 kernels can be applied in an effective manner to reduce

the number of operations that are needed. By contrast, four arbitrary 3× 3 binary convolutions

78

are not guaranteed to share any common operations, thus they must be calculated independently of

each other.

Both of these aspects, kernel reuse and common 2×1 convolutions, make Neural Jet Fea-

tures computationally efficient compared to standard binary convolutions. These bitwise efficien-

cies are particularly well suited for FPGA and ASIC implementations where data paths are de-

signed at a bitwise level of granularity.

We illustrated the computational efficiency of Neural Jet Features with a potential FPGA

hardware design, shown in Figure 5.3. This top diagram shows a typical multiply-accumulate

operation for an input channel in a BNN. The multiplication operations are simply bitwise XNOR

operations. The addition operations are organized into a tree structure to reduce the resources

needed. One accumulation tree is required for every output channel. By contrast, the number of

accumulation operations in a Neural Jet Feature layer does increase with added output channels.

All features are calculated and reused as needed to form the output channels. The addition and

subtraction units in the bottom two diagrams are the same units shown in Figure 2.10 in Chapter 2.

To form a rough comparison of the computational cost of each of the options shown in Fig-

ure 5.3, we assign a cost of 1, 2, 3, or 4 units to each of the operations depending on the number of

bits in their operands. We assign a cost of 2 to the final addition of the standard BNN convolutional

layer which has a 4-bit operand and a 1-bit operand. The standard BNN would cost 13 units per

output channel. For the unconstrained Neural Jet Features the cost would be 61 units to produce

all 9 features. The constrained version would only cost 27 units for all possible Jet Features. In

layers with 3 or more output channels, calculating all of the constrained Jet Features is less expen-

sive than standard BNN convolutions. For example, in a layer with 64 output channels, a standard

BNN layer would cost 832 units (64× 13) while a constrained Neural Jet Feature layer would

cost only 27 units, since the features would be reused as needed. The number of accumulation re-

sources does not scale the number of output channels like they do in standard BNN convolutional

layers. We note that these comparisons are hypothetical and as part of our future work we plan to

implement standard BNN convolutional layers and Neural Jet Feature layers in an FPGA to more

accurately demonstrate their computational efficiency.

79

Figure 5.3: An example of how the operations in Neural Jet Features can be arranged. The numberof operations to calculate all features in a standard BNN convolutional layer scales with the numberof output channels. For Neural Jet Features layers, the number of operations to calculate thefeatures is the same regardless of how many output channels there are.

80

5.3 Results

We tested Neural Jet Features on datasets that are representative of visual inspection tasks

where the images are of a somewhat consistent scale and/or orientation. We used the BYU Fish

dataset [111], BYU Cookie dataset (Section 5.3.3), and the MNIST dataset [10]. The MNIST

dataset has a bit more variety, which is not typical of visual inspection datasets, but it does lend

insight into how Jet Features compares on a widely used dataset.

5.3.1 Model Topology

We experimented with three different types of convolutional layers: standard binarized

convolutional layers [28], unconstrained Neural Jet Feature layers, and constrained Neural Jet

Feature layers (see Section 5.2.1). For all experiments, we used a similar VGG style model topol-

ogy: Conv-BN- Conv-MP-BN-Conv-BN-Conv-MP-BN-FC-BN-FC-BN-SM (Figure 5.4), where

Conv represents convolutional layers, BN represents batch normalization layers, MP represents

max pool layers, FC represents fully connected layers and SM represents a softmax activation

with a window of 2× 2 and stride of 2, FC represents fully connected layers where the first one

contains the number of nodes specified by the experiment and the second one contains the same

number of nodes as there are different classes in the dataset being used. The number of filters in

the convolutional layers and the number of neurons in the first fully connected layer is specified

by the experiment, shown in Table 5.1. We note that these models are smaller than most BNNs

used throughout the literature [112]. All convolutional layers used the same number of filters for

a given experiment. The activation function was the binarization operation that takes place after

the batch normalization layers, except after the final batch normalization layer where a softmax

function was used. The inputs were not binarized, similar to other BNNs throughout the literature.

5.3.2 BYU Fish Dataset

The BYU Fish dataset consists of images of eight different fish species, all oriented the

same way, 161 pixels wide and 46 pixels high. Examples are shown in Figure 2.14 in Chapter 2.

The model used when testing the BYU dataset used 8 filters in the convolutional layers and 16

neurons in the fully connected layer.

81

Figure 5.4: The model topology used for all experiments. The number of filters in the convolutionallayers and the number of nodes in the first fully connected layer change per experiment.

Table 5.1: The layer sizes used for each dataset.

Dataset Conv. Filters Fully Connected Units

BYU Fish 8 16

BYU Cookie 8 8

MNIST 16 32

MNIST 32 128

MNIST 64 256

82

From the results shown in Figure 5.5, we see that the standard BNN convolutional layers

and the constrained Neural Jet Features performed similarly on the BYU Fish dataset, both reach-

ing an average accuracy of 99.8% accuracy. Unconstrained Neural Jet Features performed worse,

hovering around 95% accuracy, significantly worse than the constrained Neural Jet Features. A

similar pattern was shown with the BYU Cookie dataset as well, which we hypothesize is due

to the fact that the unconstrained Neural Jet Features are allowed to learn features that are not as

useful as the one the constrained version learns.

Figure 5.5: Classification accuracy on the BYU Fish dataset

5.3.3 BYU Cookie Dataset

The BYU Cookie dataset includes images of sandwich-style cookies that are either in good

condition, offset, or broken, as seen in Figure 5.6. These images are 100×100 pixels in size. This

dataset is fairly small with 345 training images and 88 validation images.

The validation accuracy on the BYU Cookie dataset, shown in Figure 5.7, shows that the

normal BNN convolution and constrained Neural Jet Features outperform unconstrained Neural Jet

83

Figure 5.6: Examples from the BYU Cookie Dataset. The images in this dataset are 100× 100pixels from three classes, good (left), offset (middle) or broken (right).

Features. In addition, we saw that validation accuracy is sporadic over the course of training, which

is to be expected when dealing with a smaller dataset. The results seem to be more consistent when

using constrained Neural Jet Features than standard BNN convoultional layers, which can also be

observed in the results from the MNIST dataset.

Figure 5.7: Classification accuracy on the BYU Cookie dataset

84

5.3.4 MNIST Dataset

The MNIST dataset consists of 70,000 images of handwritten numbers [10], 28×28 pixels

in size. We trained three models of different sizes on this dataset: one model consisting of 16

filters and 32 fully connected units, one with 32 filters and 128 fully connected units, and another

with 64 filters and 256 fully connected units, which are smaller than other models trained on this

dataset [112]. Figures 5.8, 5.9, and 5.10 show the validation accuracy of these models, respectively.

The scale of the y-axis is kept consistent between these three figures in order to easily compare the

results between each of them. In the larger model, the average accuracy of all models approached

99%.

On the smaller models, the normal BNN convolutions produce inconsistent results, shown

in Figures 5.8 and 5.9, and as seen on the BYU Cookie dataset. This demonstrates a known

difficulty in working with small BNNs. Switching binarized weights between the values -1 and 1

can have dramatic effects in local regions of the network during the learning process. By adding

more weights to a model, this effect is mitigated, as seen in Figure 5.10. Our experiments show

that Neural Jet Features are less susceptible to this effect, making them a good choice for smaller

BNN models. We postulate that this is due to the fact that Neural Jet Feature convolutional layers

are limited to the classic computer vision kernels shown in Figure 5.2.

5.4 Conclusion

We have presented Neural Jet Features, a binarized convolutional layer that combines the

power of deep learning with classic computer vision kernels. Not only do Neural Jet Features

require fewer operations than standard convolutional layers in BNN’s, but are also more stable in

smaller models that would be used for visual inspection tasks. Neural Jet Features have comparable

accuracy on visual inspection datasets while requiring fewer operations and parameters. Neural Jet

Features offer an efficient solution for resource-limited systems that can take advantage of their

bitwise efficiency, like ASIC and FPGA designs.

85

Figure 5.8: Classification accuracy on the MNIST dataset with 16 filters and 32 dense units.

Figure 5.9: Classification accuracy on the MNIST dataset with 32 filters and 128 dense units.

86

Figure 5.10: Classification accuracy on the MNIST dataset with 64 filters and 256 dense units.

87

CHAPTER 6. CONCLUSION

6.1 Discussion

Resource-limited systems, like embedded computers, FPGAs, and ASICs, are compact

and low power. They are ideal platforms for many industrial visual inspection tasks. However,

high-speed image classification algorithms usually require many floating-point operators and large

amounts of memory. Resource-limited systems are not capable of running mainstream image

classification algorithms at high speeds. Large amounts of floating-point values and operations are

not well suited for resource-limited systems. In this work, we have explored the use of binary-

valued algorithms for visual inspection tasks. Binary values are well suited for resource-limited

systems, especially FPGAs and ASICs where individual bits can be easily manipulated.

The current state-of-the-art image classification algorithms are deep learning (DL) models.

They rely on large amounts of floating-point values and operations. They can be trained without

any prior knowledge of the target domain. Traditional image classification algorithms are less ex-

pensive than DL models but require more expert knowledge of the target domain during training.

Our work looks at both traditional and DL image classification techniques. We combine the el-

egance of traditional image classification with the power of DL models to make binarized image

classification algorithms that are suitable for high-speed image classification on resource-limited

systems.

In Chapter 2, we looked at an existing image classification technique that uses traditional

computer vision. The ECO Features algorithm uses a grab bag of standard image processing func-

tions and combines them using a genetic algorithm. Not all image processing functions were

suitable for resource-limited systems. The functions that were selected by the genetic algorithm

most often were convolutional kernels. These particular kernels could all be derived from smaller

binary-valued convolution kernels. We replaced the original grab bag of image processing func-

tions with these binary-valued convolutions, which we call Jet Features. This reduced the compu-

88

tational and memory costs of the algorithm. It became much simpler to implement the algorithm

on an FPGA. Compared with the original ECO Features algorithm, we saw a 3.7× speed up on

an embedded system and 78× speed up on an FPGA over a full desktop system. Accuracy was

similar for both algorithms.

Chapter 3 reviewed Binarized Neural Networks (BNNs). These networks use binary values

for both the weights and activations. They are trained through deep learning. The original BNN

introduced by Courbariaux et al. was able to process small datasets like the MNIST dataset. Much

of the work on BNNs has focused on improving accuracy on more challenging datasets like Im-

ageNet. FPGA designers have also paid special attention to BNN. There have been many works

that implement BNNs on FPGAs. Most of these works require CPU/FPGA crossing systems or

large stand-alone FPGAs.

Chapters 4 and 5 look at ways to scale down the size of BNNs to allow them to fit on

smaller, more affordable FPGAs. In Chapter 4, we explored some techniques that are used for

full precision systems and applied them to BNNs. However, they did not translate well to BNNs.

In Chapter 5, we introduced Neural Jet Features, which are Jet Features that are trained using the

same deep learning methods used in BNNs. Neural Jet Features were just as accurate as BNNs on

visual inspection tasks but only require a fraction of the number of operations. They are also more

stable when training smaller models.

6.2 Summary of Contributions

• Developed Jet Features:

– Uses common building blocks that can create essential convolution kernels like the

Gaussian and Sobel filters.

– Can be computed efficiently as a whole set, reusing common building blocks.

– Uses binary value, eliminating the need for any multiplication and floating-point values.

– Can be implemented on an FPGA using only line buffers and addition/subtraction units.

• Designed a software implementation of the ECO Jet Features algorithm with a 3.7× speedup

over the original ECO Features Algorithm.

89

• Implemented the ECO Jet Features Algorithm on an FPGA which achieved a 78× speedup

over the original ECO Features Algorithm on a full-sized desktop.

• Surveyed the literature of BNNs and compared the most prominent methods, weight their

strengths and weaknesses.

• Compared the various FPGA implementations of BNNs through the literature.

• Adopted full-scale deep learning techniques for BNN in order to reduce the size of BNN

models.

• Developed the Neural Jet Feature Layer to replace the convolutional layer in BNNs and

compared them with standard BNNs.

– Allows the network to reuse filter outputs in order to reduce the required number of

operations and memory space.

– Computes multiple features as a group.

– Performs just as well, if not better than standard BNNs on visual inspection tasks while

only using a fraction of the required number of operations.

6.3 Future Work

The Neural Jet Features algorithm needs to be explored further. We plan on implementing

Neural Jet Feature layers on an FPGA. This will give us more insight into how the method can be

further developed to reduce the computational and memory costs of the algorithm. We have not

yet explored reusing Jet Features between layers. This could be explored through the use of skip

connection, but only of individual layer features. This may reduce the memory costs that were

discussed in Section 4.2.2.

We have not yet explored the use of Neural Jet Features mixed with standard BNN convolu-

tional layers. Neural Jet Features use the same classic kernels that are used in the image processing

portions of traditional image classification systems (See Figure 1.1). It may be advantageous to use

Neural Jet Features as front end image processing stage, followed by standard BNN convolutional

layers, followed by the fully connected layer back end. This may allow the algorithm to perform

90

well on more complex datasets while still reducing the number of operations and memory space

requirements.

91

REFERENCES

[1] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L., 2015. “ImageNet Large ScaleVisual Recognition Challenge.” International Journal of Computer Vision (IJCV), 115(3),pp. 211–252. 3, 4, 6

[2] Lowe, D. G., 2004. “Distinctive image features from scale-invariant keypoints.” Interna-tional Journal of Computer Vision, 60(2), Nov, pp. 91–110. 3, 77

[3] Bay, H., Tuytelaars, T., and Van Gool, L., 2006. “Surf: Speeded up robust features.” InComputer Vision – ECCV 2006, A. Leonardis, H. Bischof, and A. Pinz, eds., Springer BerlinHeidelberg, pp. 404–417. 3, 77

[4] Rublee, E., Rabaud, V., Konolige, K., and Bradski, G., 2011. “Orb: An efficient alternativeto sift or surf.” In 2011 International Conference on Computer Vision, pp. 2564–2571. 3

[5] Cortes, C., and Vapnik, V., 1995. “Support-vector networks.” In Machine Learning,pp. 273–297. 3

[6] Kotsiantis, S. B., 2013. “Decision trees: a recent overview.” Artificial Intelligence Review,39(4), Apr, pp. 261–283. 3

[7] Csurka, G., Dance, C. R., Fan, L., Willamowski, J., and Bray, C., 2004. “Visual categoriza-tion with bags of keypoints.” In In Workshop on Statistical Learning in Computer Vision,ECCV, pp. 1–22. 3

[8] Guo, G., Wang, H., Bell, D., Bi, Y., and Greer, K., 2003. “Knn model-based approachin classification.” In On The Move to Meaningful Internet Systems 2003: CoopIS, DOA,and ODBASE, R. Meersman, Z. Tari, and D. C. Schmidt, eds., Springer Berlin Heidelberg,pp. 986–996. 3

[9] MacQueen, J., 1967. “Some methods for classification and analysis of multivariate obser-vations.” In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics andProbability, Volume 1: Statistics, University of California Press, pp. 281–297. 3

[10] Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P., 1998. “Gradient-based learning appliedto document recognition.” Proceedings of the IEEE, 86(11), Nov, pp. 2278–2324. 4, 81, 85

[11] Krizhevsky, A., Sutskever, I., and Hinton, G. E., 2012. “Imagenet classification withdeep convolutional neural networks.” In Proceedings of the 25th International Conferenceon Neural Information Processing Systems - Volume 1, NIPS’12, Curran Associates Inc.,pp. 1097–1105. 4, 12

92

[12] Simonyan, K., and Zisserman, A., 2014. “Very deep convolutional networks for large-scaleimage recognition.” arXiv 1409.1556, 09. 4

[13] He, K., Gkioxari, G., Dollar, P., and Girshick, R. B., 2017. “Mask R-CNN.” CoRR,abs/1703.06870. 4

[14] Huang, G., Liu, Z., Weinberger, K. Q., and van der Maaten, L., 2016. “Densely connectedconvolutional networks.” 2017 IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), pp. 2261–2269. 4, 69

[15] Xie, S., Girshick, R., Dollar, P., Tu, Z., and He, K., 2017. “Aggregated residual transforma-tions for deep neural networks.” pp. 5987–5995. 4

[16] Lillywhite, K., Lee, D.-J., Tippetts, B., and Archibald, J., 2013. “A feature constructionmethod for general object recognition.” Pattern Recognition, 46(12), pp. 3300 – 3314. 4, 9,12, 13, 16

[17] Florack, L., Ter Haar Romeny, B., Viergever, M., and Koenderink, J., 1996. “The gaussianscale-space paradigm and the multiscale local jet.” Int. J. Comput. Vision, 18(1), apr, pp. 61–75. 5, 10, 12

[18] Kim, M., and Smaragdis, P., 2018. “Bitwise Neural Networks for Efficient Single-ChannelSource Separation.” In 2018 IEEE International Conference on Acoustics, Speech and Sig-nal Processing (ICASSP), Vol. 2018-April, IEEE, pp. 701–705. 5, 56

[19] Lillywhite, K., Tippetts, B., and Lee, D.-J., 2012. “Self-tuned evolution-constructed featuresfor general object recognition.” Pattern Recognition, 45(1), pp. 241 – 251. 8, 9, 12, 16, 75

[20] Lillholm, M., and Pedersen, K. S., 2004. “Jet based feature classification.” Proceedings ofthe 17th International Conference on Pattern Recognition, 2004. ICPR 2004., 2, pp. 787–790 Vol.2. 12

[21] Larsen, A. B. L., Darkner, S., Dahl, A. L., and Pedersen, K. S., 2012. “Jet-based local imagedescriptors.” In Computer Vision – ECCV 2012, A. Fitzgibbon, S. Lazebnik, P. Perona,Y. Sato, and C. Schmid, eds., Springer Berlin Heidelberg, pp. 638–650. 12

[22] Manzanera, A., 2011. “Local jet feature space framework for image processing and repre-sentation.” In 2011 Seventh International Conference on Signal Image Technology Internet-Based Systems, pp. 261–268. 12

[23] Breiman, L., 2001. “Random forests.” Machine Learning, 45(1), Oct, pp. 5–32. 13

[24] Zhu, J., Rosset, S., Zou, H., and Hastie, T., 2006. “Multi-class adaboost.” Statistics and itsinterface, 2, 02. 15

[25] Freund, Y., and Schapire, R. E., 1997. “A decision-theoretic generalization of on-line learn-ing and an application to boosting.” J. Comput. Syst. Sci., 55(1), aug, pp. 119–139. 15

[26] Zhang, M., Lee, D.-J., Lillywhite, K., and Tippetts, B., 2017. “Automatic quality and mois-ture evaluations using evolution constructed features.” Computers and Electronics in Agri-culture, 135, pp. 321 – 327. 16

93

[27] Prost-Boucle, A., Bourge, A., Petrot, F., Alemdar, H., Caldwell, N., and Leroy, V., 2017.“Scalable high-performance architecture for convolutional ternary neural networks on fpga.”In 2017 27th International Conference on Field Programmable Logic and Applications(FPL), pp. 1–7. 29

[28] Courbariaux, M., and Bengio, Y., 2016. “Binarynet: Training deep neural networks withweights and activations constrained to +1 or -1.” CoRR, abs/1602.02830. 33, 37, 38, 39,41, 42, 43, 44, 49, 53, 56, 57, 59, 60, 61, 81

[29] Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto,M., and Adam, H., 2017. “Mobilenets: Efficient convolutional neural networks for mobilevision applications.” CoRR, abs/1704.04861. 35, 67

[30] Jaderberg, M., Vedaldi, A., and Zisserman, A., 2014. “Speeding up convolutional neuralnetworks with low rank expansions.” CoRR, abs/1405.3866. 35

[31] Chen, Y., Wang, N., and Zhang, Z., 2017. “Darkrank: Accelerating deep metric learning viacross sample similarities transfer.” CoRR, abs/1707.01220. 35

[32] Iandola, F. N., Moskewicz, M. W., Ashraf, K., Han, S., Dally, W. J., and Keutzer, K., 2016.“Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size.”CoRR, abs/1602.07360. 35

[33] Hanson, S. J., and Pratt, L., 1989. “Advances in neural information processing systems1.” Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ch. Comparing Biases forMinimal Network Construction with Back-propagation, pp. 177–185. 35

[34] Cun, Y. L., Denker, J. S., and Solla, S. A., 1990. “Advances in neural information processingsystems 2.” Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ch. Optimal BrainDamage, pp. 598–605. 35

[35] Han, S., Mao, H., and Dally, W. J., 2016. “Deep compression: Compressing deep neuralnetwork with pruning, trained quantization and huffman coding.” CoRR, abs/1510.00149.35

[36] Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P., 2015. “Deep Learning withLimited Numerical Precision.”. 36

[37] Courbariaux, M., Bengio, Y., and David, J.-P., 2014. “Training deep neural networks withlow precision multiplications.” pp. 1–10. 36

[38] Zhou, S., Ni, Z., Zhou, X., Wen, H., Wu, Y., and Zou, Y., 2016. “Dorefa-net: Train-ing low bitwidth convolutional neural networks with low bitwidth gradients.” CoRR,abs/1606.06160. 36, 44, 45, 51, 54, 60, 62

[39] Seo, J., Yu, J., Lee, J., and Choi, K., 2016. “A new approach to binarizing neural networks.”In 2016 International SoC Design Conference (ISOCC), IEEE, pp. 77–78. 37, 60

94

[40] Yonekawa, H., Sato, S., and Nakahara, H., 2018. “A Ternary Weight Binary Input Convo-lutional Neural Network: Realization on the Embedded Processor.” In 2018 IEEE 48th In-ternational Symposium on Multiple-Valued Logic (ISMVL), Vol. 2018-May, IEEE, pp. 174–179. 37, 61

[41] Hwang, K., and Sung, W., 2014. “Fixed-point feedforward deep neural network designusing weights +1, 0, and -1.” IEEE Workshop on Signal Processing Systems, SiPS: Designand Implementation, pp. 1–6. 37, 60

[42] Prost-Boucle, A., Bourge, A., and Petrot, F., 2018. “High-Efficiency Convolutional TernaryNeural Networks with Custom Adder Trees and Weight Compression.” ACM Transactionson Reconfigurable Technology and Systems, 11(3), dec, pp. 1–24. 37, 60, 61

[43] Saad, D., and Marom, E., 1990. “Training feed forward nets with binary weights via amodified chir algorithm.” Complex Systems, 4, 01. 37

[44] Baldassi, C., Braunstein, A., Brunel, N., and Zecchina, R., 2007. “Efficient supervisedlearning in networks with binary synapses.” Proceedings of the National Academy of Sci-ences, 104(26), pp. 11079–11084. 37

[45] Soudry, D., Hubara, I., and Meir, R., 2014. “Expectation Backpropagation: parameter-freetraining of multilayer neural networks with real and discrete weights.” Advances in NeuralInformation Processing Systems, 2(1), pp. 963—-971. 37, 56

[46] Courbariaux, M., Bengio, Y., and David, J.-P., 2015. “BinaryConnect: Training Deep NeuralNetworks with binary weights during propagations.” In NIPS, pp. 3123–3131. 37, 38, 43,60, 61

[47] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, R. F., 2013. “DropConnect.” Interna-tional Conference on Machine Learning. 37

[48] Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y., 2016. “BinarizedNeural Networks.” In NIPS, pp. 1–9. 37, 38, 43, 44, 62

[49] Ding, R., Liu, Z., Shi, R., Marculescu, D., and Blanton, R. S., 2017. “LightNN.” InProceedings of the on Great Lakes Symposium on VLSI 2017 - GLSVLSI ’17, ACM Press,pp. 35–40. 37

[50] Ding, R., Liu, Z., Blanton, R. D. S., and Marculescu, D., 2018. “Lightening the Loadwith Highly Accurate Storage- and Energy-Efficient LightNNs.” ACM Transactions onReconfigurable Technology and Systems, 11(3), dec, pp. 1–24. 37

[51] Bengio, Y., Leonard, N., and Courville, A., 2013. “Estimating or Propagating GradientsThrough Stochastic Neurons for Conditional Computation.” pp. 1–12. 38

[52] Lin, X., Zhao, C., and Pan, W., 2017. “Towards Accurate Binary Convolutional NeuralNetwork.” In NIPS. 42, 46, 51, 52, 62

[53] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus,R., 2014. “Intriguing properties of neural networks.” CoRR, abs/1312.6199. 42

95

[54] Moosavi-Dezfooli, S., Fawzi, A., Fawzi, O., and Frossard, P., 2016. “Universal adversarialperturbations.” CoRR, abs/1610.08401. 42

[55] Galloway, A., Taylor, G. W., and Moussa, M., 2017. “Attacking binarized neural networks.”CoRR, abs/1711.00449. 42

[56] Khalil, E. B., Gupta, A., and Dilkina, B., 2018. “Combinatorial attacks on binarized neuralnetworks.” CoRR, abs/1810.03538. 42

[57] Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y., 2017. “Quantizedneural networks: Training neural networks with low precision weights and activations.” J.Mach. Learn. Res., 18(1), jan, pp. 6869–6898. 44, 62

[58] Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A., 2016. “XNOR-Net: ImageNetClassification Using Binary Convolutional Neural Networks.” In ECCV. pp. 525–542. 44,51, 62, 67

[59] Kanemura, A., Sawada, H., Wakisaka, T., and Hano, H., 2017. “Experimental explorationof the performance of binary networks.” In 2017 IEEE 2nd International Conference onSignal and Image Processing (ICSIP), Vol. 2017-Janua, IEEE, pp. 451–455. 44, 61

[60] Tang, W., Hua, G., and Wang, L., 2017. “How to Train a Compact Binary Neural Networkwith High Accuracy?.” AAAI. 45, 46, 47, 51, 52, 53, 55, 62

[61] Darabi, S., Belbahri, M., Courbariaux, M., and Nia, V. P., 2018. “BNN+: Improved BinaryNetwork Training.” Seventh International Conference on Learning Representations, dec,pp. 1–10. 51, 53, 61, 62

[62] Ghasemzadeh, M., Samragh, M., and Koushanfar, F., 2018. “ReBNet: Residual Bina-rized Neural Network.” In 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), IEEE, pp. 57–64. 52, 59, 60, 61, 62,63, 64, 65, 75

[63] Prabhu, A., Batchu, V., Gajawada, R., Munagala, S. A., and Namboodiri, A., 2018. “HybridBinary Networks: Optimizing for Accuracy, Efficiency and Memory.” In 2018 IEEE WinterConference on Applications of Computer Vision (WACV), Vol. 2018-Janua, IEEE, pp. 821–829. 53, 62

[64] Wang, H., Xu, Y., Ni, B., Zhuang, L., and Xu, H., 2018. “Flexible Network Binarizationwith Layer-Wise Priority.” In 2018 25th IEEE International Conference on Image Process-ing (ICIP), IEEE, pp. 2346–2350. 53

[65] Zhao, R., Song, W., Zhang, W., Xing, T., Lin, J.-H., Srivastava, M., Gupta, R., andZhang, Z., 2017. “Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs.” In Proceedings of the 2017 ACM/SIGDA International Symposiumon Field-Programmable Gate Arrays - FPGA ’17, ACM Press, pp. 15–24. 54, 61, 63, 64,65, 75

96

[66] Guo, P., Ma, H., Chen, R., Li, P., Xie, S., and Wang, D., 2018. “FBNA: A Fully Bina-rized Neural Network Accelerator.” In 2018 28th International Conference on Field Pro-grammable Logic and Applications (FPL), IEEE, pp. 51–513. 54, 60, 61, 63, 64, 65, 75

[67] Fraser, N. J., Umuroglu, Y., Gambardella, G., Blott, M., Leong, P., Jahre, M., and Vissers,K., 2017. “Scaling Binarized Neural Networks on Reconfigurable Logic.” In Proceedings ofthe 8th Workshop and 6th Workshop on Parallel Programming and Run-Time ManagementTechniques for Many-core Architectures and Design Tools and Architectures for MulticoreEmbedded Computing Platforms - PARMA-DITAM ’17, ACM Press, pp. 25–30. 54, 61, 64,65, 67

[68] Umuroglu, Y., Fraser, N. J., Gambardella, G., Blott, M., Leong, P., Jahre, M., and Vissers,K., 2017. “FINN.” In Proceedings of the 2017 ACM/SIGDA International Symposium onField-Programmable Gate Arrays - FPGA ’17, ACM Press, pp. 65–74. 56, 59, 60, 61, 63,64, 65

[69] Song, D., Yin, S., Ouyang, P., Liu, L., and Wei, S., 2018. “Low Bits: Binary Neural Networkfor Vad and Wakeup.” In 2018 5th International Conference on Information Science andControl Engineering (ICISCE), IEEE, pp. 306–311. 56

[70] Yin, S., Ouyang, P., Zheng, S., Song, D., Li, X., Liu, L., and Wei, S., 2018. “A 141 UW, 2.46PJ/Neuron Binarized Convolutional Neural Network Based Self-Learning Speech Recogni-tion Processor in 28NM CMOS.” In 2018 IEEE Symposium on VLSI Circuits, Vol. 2018-June, IEEE, pp. 139–140. 56, 64

[71] Li, Y., Liu, Z., Liu, W., Jiang, Y., Wang, Y., Goh, W. L., Yu, H., and Ren, F., 2018. “A34-FPS 698-GOP/s/W Binarized Deep Neural Network-based Natural Scene Text Interpre-tation Accelerator for Mobile Edge Computing.” IEEE Transactions on Industrial Electron-ics, PP(c), pp. 1–1. 56, 64

[72] Bulat, A., and Tzimiropoulos, G., 2017. “Binarized Convolutional Landmark Localizersfor Human Pose Estimation and Face Alignment with Limited Resources.” In 2017 IEEEInternational Conference on Computer Vision (ICCV), Vol. 2017-Octob, IEEE, pp. 3726–3734. 56

[73] Ma, C., Guo, Y., Lei, Y., and An, W., 2019. “Binary Volumetric Convolutional NeuralNetworks for 3-D Object Recognition.” IEEE Transactions on Instrumentation and Mea-surement, 68(1), jan, pp. 38–48. 56

[74] Eccv, A., 2018. Efficient Super Resolution Using Binarized Neural Network. 56

[75] Bulat, A., and Tzimiropoulos, Y., 2018. “Hierarchical binary CNNs for landmark localiza-tion with limited resources.” IEEE Transactions on Pattern Analysis and Machine Intelli-gence, PP(8), pp. 1–1. 56

[76] Say, B., and Sanner, S., 2018. “Planning in factored state and action spaces with learnedbinarized neural network transition models.” IJCAI International Joint Conference on Arti-ficial Intelligence, 2018-July, pp. 4815–4821. 56

97

[77] Chi, C.-C., and Jiang, J.-H. R., 2018. “Logic synthesis of binarized neural networks for effi-cient circuit implementation.” In Proceedings of the International Conference on Computer-Aided Design - ICCAD ’18, ACM Press, pp. 1–7. 59, 61

[78] Narodytska, N., Ryzhyk, L., and Walsh, T., 2018. “Verifying Properties of Binarized DeepNeural Networks.” pp. 6615–6624. 59

[79] Yang, H., Fritzsche, M., Bartz, C., and Meinel, C., 2017. “BMXNet: An Open-SourceBinary Neural Network Implementation Based on MXNet.” Workshop: Proceedings ofNew Security Paradigms, may. 59, 61

[80] Blott, M., Preußer, T. B., Fraser, N. J., Gambardella, G., O’brien, K., Umuroglu, Y., Leeser,M., and Vissers, K., 2018. “FINN-R: An End-to-End Deep-Learning Framework for FastExploration of Quantized Neural Networks.” ACM Transactions on Reconfigurable Tech-nology and Systems, 11(3), dec, pp. 1–23. 59, 60, 61, 63, 64, 65, 75

[81] McDanel, B., Teerapittayanon, S., and Kung, H. T., 2017. “Embedded Binarized NeuralNetworks.” pp. 168–173. 59, 61

[82] Jokic, P., Emery, S., and Benini, L., 2018. “BinaryEye: A 20 kfps Streaming Camera Systemon FPGA with Real-Time On-Device Image Recognition Using Binary Neural Networks.”In 2018 IEEE 13th International Symposium on Industrial Embedded Systems (SIES), IEEE,pp. 1–7. 59, 64, 65

[83] Valavi, H., Ramadge, P. J., Nestler, E., and Verma, N., 2018. “A Mixed-Signal BinarizedConvolutional-Neural-Network Accelerator Integrating Dense Weight Storage and Multipli-cation for Reduced Data Movement.” In 2018 IEEE Symposium on VLSI Circuits, Vol. 2018-June, IEEE, pp. 141–142. 59, 61, 64

[84] Kim, M., Smaragdis, P., and Edu, P. I., 2016. “Bitwise Neural Networks.”. 59, 60

[85] Sun, X., Yin, S., Peng, X., Liu, R., Seo, J.-s., and Yu, S., 2018. “XNOR-RRAM: A scalableand parallel resistive synaptic architecture for binary neural networks.” In 2018 Design,Automation & Test in Europe Conference & Exhibition (DATE), Vol. 2018-Janua, IEEE,pp. 1423–1428. 59, 61, 66

[86] Yu, S., Li, Z., Chen, P.-Y., Wu, H., Gao, B., Wang, D., Wu, W., and Qian, H., 2016. “Binaryneural network with 16 Mb RRAM macro chip for classification and online training.” In2016 IEEE International Electron Devices Meeting (IEDM), IEEE, pp. 16.2.1–16.2.4. 60,66

[87] Zhou, Y., Redkar, S., and Huang, X., 2017. “Deep learning binary neural network on anFPGA.” In 2017 IEEE 60th International Midwest Symposium on Circuits and Systems(MWSCAS), Vol. 2017-Augus, IEEE, pp. 281–284. 61, 63, 65

[88] Nakahara, H., Fujii, T., and Sato, S., 2017. “A fully connected layer elimination for a bi-narizec convolutional neural network on an FPGA.” In 2017 27th International Conferenceon Field Programmable Logic and Applications (FPL), IEEE, pp. 1–4. 61, 63, 65

98

[89] Yang, L., He, Z., and Fan, D., 2018. “A Fully Onchip Binarized Convolutional NeuralNetwork FPGA Impelmentation with Accurate Inference.” Proceedings of the InternationalSymposium on Low Power Electronics and Design, pp. 50:1—-50:6. 61, 63, 64, 65

[90] Bankman, D., Yang, L., Moons, B., Verhelst, M., and Murmann, B., 2019. “An Always-On 3.8 micro J/86% CIFAR-10 Mixed-Signal Binary CNN Processor With All Memory onChip in 28-nm CMOS.” IEEE Journal of Solid-State Circuits, 54(1), jan, pp. 158–172. 61,64

[91] Rusci, M., Rossi, D., Flamand, E., Gottardi, M., Farella, E., and Benini, L., 2018. “Always-ON visual node with a hardware-software event-based binarized neural network inferenceengine.” In Proceedings of the 15th ACM International Conference on Computing Frontiers- CF ’18, no. 1, ACM Press, pp. 314–319. 61

[92] Ding, R., Liu, Z., Blanton, R. D. S., and Marculescu, D., 2018. “Quantized Deep NeuralNetworks for Energy Efficient Hardware-based Inference.” pp. 1–8. 62

[93] Ling, Y., Zhong, K., Wu, Y., Liu, D., Ren, J., Liu, R., Duan, M., Liu, W., and Liang, L.,2018. “TaiJiNet: Towards Partial Binarized Convolutional Neural Network for EmbeddedSystems.” In 2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Vol. 2018-July, IEEE, pp. 136–141. 62

[94] Yonekawa, H., and Nakahara, H., 2017. “On-Chip Memory Based Binarized ConvolutionalDeep Neural Network Applying Batch Normalization Free Technique on an FPGA.” In 2017IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW),IEEE, pp. 98–105. 62, 63, 64, 65, 66, 75

[95] Rybalkin, V., Pappalardo, A., Ghaffar, M. M., Gambardella, G., Wehn, N., and Blott, M.,2018. “FINN-L: Library Extensions and Design Trade-Off Analysis for Variable Preci-sion LSTM Networks on FPGAs.” In 2018 28th International Conference on Field Pro-grammable Logic and Applications (FPL), IEEE, pp. 89–897. 63, 64

[96] Nakahara, H., Yonekawa, H., Sasao, T., Iwamoto, H., and Motomura, M., 2016. “Amemory-based realization of a binarized deep convolutional neural network.” In 2016 In-ternational Conference on Field-Programmable Technology (FPT), IEEE, pp. 277–280. 63,65

[97] Nakahara, H., Yonekawa, H., Fujii, T., and Sato, S., 2018. “A Lightweight YOLOv2.”In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-ProgrammableGate Arrays - FPGA ’18, ACM Press, pp. 31–40. 63

[98] Faraone, J., Fraser, N., Blott, M., and Leong, P. H. W., 2018. “SYQ: Learning SymmetricQuantization For Efficient Deep Neural Networks.”. 64

[99] Nurvitadhi, E., Sheffield, D., Jaewoong Sim, Mishra, A., Venkatesh, G., and Marr, D., 2016.“Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC.”In 2016 International Conference on Field-Programmable Technology (FPT), IEEE, pp. 77–84. 64, 66

99

[100] Jafari, A., Hosseini, M., Kulkarni, A., Patel, C., and Mohsenin, T., 2018. “BiNMAC.” InProceedings of the 2018 on Great Lakes Symposium on VLSI - GLSVLSI ’18, ACM Press,pp. 443–446. 64

[101] Bahou, A. A., Karunaratne, G., Andri, R., Cavigelli, L., and Benini, L., 2018. “XNORBIN:A 95 TOp/s/W hardware accelerator for binary convolutional neural networks.” In 2018IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS), IEEE, pp. 1–3. 64

[102] Rusci, M., Cavigelli, L., and Benini, L., 2018. “Design Automation for Binarized NeuralNetworks: A Quantum Leap Opportunity?.” In 2018 IEEE International Symposium onCircuits and Systems (ISCAS), Vol. 732631, IEEE, pp. 1–5. 66

[103] Sun, X., Liu, R., Peng, X., and Yu, S., 2018. “Computing-in-Memory with SRAM andRRAM for Binary Neural Networks.” In 2018 14th IEEE International Conference onSolid-State and Integrated Circuit Technology (ICSICT), IEEE, pp. 1–4. 66

[104] Choi, W., Jeong, K., Choi, K., Lee, K., and Park, J., 2018. “Content addressable mem-ory based binarized neural network accelerator using time-domain signal processing.” InProceedings of the 55th Annual Design Automation Conference on - DAC ’18, ACM Press,pp. 1–6. 66

[105] Angizi, S., and Fan, D., 2017. “IMC: Energy -Efficient In-Memory Concvolver for Accel-erating Binarized Deep Neural Networks.” In Proceedings of the Neuromorphic ComputingSymposium on - NCS ’17, no. 1, ACM Press, pp. 1–8. 66

[106] Liu, R., Peng, X., Sun, X., Khwa, W.-S., Si, X., Chen, J.-J., Li, J.-F., Chang, M.-F., andYu, S., 2018. “Parallelizing SRAM Arrays with Customized Bit-Cell for Binary NeuralNetworks.” In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), IEEE,pp. 1–6. 66

[107] Zhou, Z., Huang, P., Xiang, Y. C., Shen, W. S., Zhao, Y. D., Feng, Y. L., Gao, B., Wu,H. Q., Qian, H., Liu, L. F., Zhang, X., Liu, X. Y., and Kang, J. F., 2018. “A new hardwareimplementation approach of BNNs based on nonlinear 2T2R synaptic cell.” In 2018 IEEEInternational Electron Devices Meeting (IEDM), IEEE, pp. 20.7.1–20.7.4. 66

[108] Tang, T., Xia, L., Li, B., Wang, Y., and Yang, H., 2017. “Binary convolutional neuralnetwork on rram.” In 2017 22nd Asia and South Pacific Design Automation Conference(ASP-DAC), pp. 782–787. 66

[109] Yang, L., He, Z., and Fan, D., 2019. “Binarized depthwise separable neural network forobject tracking in fpga.” In Proceedings of the 2019 on Great Lakes Symposium on VLSI,GLSVLSI ’19, Association for Computing Machinery, p. 347–350. 69

[110] He, K., Zhang, X., Ren, S., and Sun, J., 2016. “Deep residual learning for image recog-nition.” In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),pp. 770–778. 71

[111] Simons, T., and Lee, D.-J., 2019. “Jet features: Hardware-friendly, learned convolutionalkernels for high-speed image classification.” Electronics, 8(5). 75, 81

100

[112] Simons, T., and Lee, D.-J., 2019. “A review of binarized neural networks.” Electronics,8(6). 81, 85

101