19
HOBFLOPS CNN S :H ARDWARE O PTIMIZED B ITSLICE -PARALLEL F LOATING -P OINT O PERATIONS FOR C ONVOLUTIONAL N EURAL N ETWORKS APREPRINT James Garland * The School of Computer Science and Statistics Trinity College Dublin, The University of Dublin College Green, Dublin 2, Ireland [email protected] David Gregg The School of Computer Science and Statistics Trinity College Dublin, The University of Dublin College Green, Dublin 2, Ireland [email protected] November 18, 2021 ABSTRACT Convolutional neural networks (CNNs) are typically trained using 16- or 32-bit floating-point (FP) and researchers show that low-precision floating-point (FP) can be highly effective for inference. Low-precision FP can be implemented in field programmable gate array (FPGA) and application- specific integrated circuit (ASIC) accelerators, but existing processors do not generally support custom precision FP. We propose hardware optimized bitslice-parallel floating-point operators (HOBFLOPS), a method of generating efficient custom-precision emulated bitslice-parallel software FP arithmetic. We generate custom-precision FP routines optimized using a hardware synthesis design flow to create circuits. We provide standard cell libraries matching the bitwise operations on the target microprocessor architecture, and a code-generator to translate the hardware circuits to bitslice software equivalents. We exploit bitslice parallelism to create a very wide (32–512 element) vectorized convolutional neural network (CNN) convolution. Hardware optimized bitslice-parallel floating-point operators (HOBFLOPS) multiply-accumulate (MAC) performance in CNN convolution on Arm and Intel processors are compared to Berke- ley’s SoftFP16 equivalent MAC. HOBFLOPS16 outperforms SoftFP16 by 8× on Intel AVX512. HOBFLOPS offers arbitrary-precision FP with custom range and precision e.g., HOBFLOPS9 per- forms at 6× the performance of HOBFLOPS16 on Arm Neon. HOBFLOPS allows researchers to prototype different levels of custom FP precision in the arithmetic of software CNN accelerators. Furthermore, HOBFLOPS fast custom-precision FP CNNs may be valuable in cases where memory bandwidth is limited. Keywords Bitslice parallel arithmetic, datapath circuits, hardware accelerators, reduced floating-point precision arithmetic, convolutional neural networks. 1 Introduction Many researchers have shown that CNN inference is possible with low-precision integer [1] and floating-point (FP) [2, 3] arithmetic. Almost all processors provide excellent support for 8-bit integer values, but not for bit-level custom precision FP types, such as 9-bit FP. Typically processors support a small number of relatively high-precision FP types, such as 32- and 64-bit. However, there are good reasons why we might want to implement custom-precision FP on regular processors. Researchers and hardware developers may want to prototype different levels of custom FP precision * This research is supported by Science Foundation Ireland, Project 12/IA/1381. We thank the Institute of Technology Carlow, Carlow, Ireland for their support. arXiv:2007.06563v2 [cs.AR] 18 Feb 2021

HOBFLOPS CNNS: HARDWARE OPTIMIZED -P FLOATING-POINT

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

HOBFLOPS CNNS: HARDWARE OPTIMIZEDBITSLICE-PARALLEL FLOATING-POINT OPERATIONS FOR

CONVOLUTIONAL NEURAL NETWORKS

A PREPRINT

James Garland∗

The School of Computer Science and StatisticsTrinity College Dublin, The University of Dublin

College Green, Dublin 2, [email protected]

David GreggThe School of Computer Science and Statistics

Trinity College Dublin, The University of DublinCollege Green, Dublin 2, Ireland

[email protected]

November 18, 2021

ABSTRACT

Convolutional neural networks (CNNs) are typically trained using 16- or 32-bit floating-point (FP)and researchers show that low-precision floating-point (FP) can be highly effective for inference.Low-precision FP can be implemented in field programmable gate array (FPGA) and application-specific integrated circuit (ASIC) accelerators, but existing processors do not generally supportcustom precision FP.We propose hardware optimized bitslice-parallel floating-point operators (HOBFLOPS), a method ofgenerating efficient custom-precision emulated bitslice-parallel software FP arithmetic. We generatecustom-precision FP routines optimized using a hardware synthesis design flow to create circuits.We provide standard cell libraries matching the bitwise operations on the target microprocessorarchitecture, and a code-generator to translate the hardware circuits to bitslice software equivalents.We exploit bitslice parallelism to create a very wide (32–512 element) vectorized convolutional neuralnetwork (CNN) convolution.Hardware optimized bitslice-parallel floating-point operators (HOBFLOPS) multiply-accumulate(MAC) performance in CNN convolution on Arm and Intel processors are compared to Berke-ley’s SoftFP16 equivalent MAC. HOBFLOPS16 outperforms SoftFP16 by 8× on Intel AVX512.HOBFLOPS offers arbitrary-precision FP with custom range and precision e.g., HOBFLOPS9 per-forms at 6× the performance of HOBFLOPS16 on Arm Neon. HOBFLOPS allows researchers toprototype different levels of custom FP precision in the arithmetic of software CNN accelerators.Furthermore, HOBFLOPS fast custom-precision FP CNNs may be valuable in cases where memorybandwidth is limited.

Keywords Bitslice parallel arithmetic, datapath circuits, hardware accelerators, reduced floating-point precisionarithmetic, convolutional neural networks.

1 Introduction

Many researchers have shown that CNN inference is possible with low-precision integer [1] and floating-point (FP)[2, 3] arithmetic. Almost all processors provide excellent support for 8-bit integer values, but not for bit-level customprecision FP types, such as 9-bit FP. Typically processors support a small number of relatively high-precision FP types,such as 32- and 64-bit. However, there are good reasons why we might want to implement custom-precision FP onregular processors. Researchers and hardware developers may want to prototype different levels of custom FP precision

∗This research is supported by Science Foundation Ireland, Project 12/IA/1381. We thank the Institute of Technology Carlow,Carlow, Ireland for their support.

arX

iv:2

007.

0656

3v2

[cs

.AR

] 1

8 Fe

b 20

21

A PREPRINT - NOVEMBER 18, 2021

that might be used for arithmetic in CNN accelerators. Furthermore, fast custom-precision FP CNNs in software maybe valuable in its own right, particularly in cases where memory bandwidth is limited.

While the choice of custom FP in general purpose central processing units (CPUs) is limited, FP simulators, such as theexcellent Berkeley’s SoftFP [4], are available. These simulators support certain ranges of custom FP such as 16-, 32-,64-, 80- and 128-bit with corresponding fixed-width mantissa and exponents.

We propose hardware optimized bitslice-parallel floating-point operators (HOBFLOPS) which offers arbitrary-precisionFP arithmetic, using software bitslice parallel [5] arithmetic, emulating any required precision FP arithmetic at anybit-width of mantissa or exponent. Our goal is to use bit-slice packing to pack the vector registers of the microprocessorefficiently. Also, we exploit the bitwise logic optimization strategies of a commercial hardware synthesis tool tooptimize the associated bitwise arithmetic that forms a multiplier and adder. We generate efficient arbitrary-precisionsoftware FP emulation types and arithmetic, optimized using hardware tools and converted to the target processor’sbitwise logic operators.

Existing methods of emulating FP arithmetic in software primarily use existing integer instructions to implementthe steps of FP computation. This can work well for large, regular-sized FP types such as FP16, FP32, or FP64.Berkeley offer SoftFP emulation [4] for use where, for example, only integer precision instructions are available. SoftFPemulation supports 16- to 128-bit arithmetic and does not support low bit-width custom precision FP arithmetic orparallel arithmetic. HOBFLOPS offers fine grained customizable precision mantissa and exponent FP bitwise arithmeticcomputed in parallel. To evaluate performance we benchmark HOBFLOPS16 parallel MACs against Berkeley’sSoftFP16 MulAdd, in CNN convolution implemented with Arm and Intel scalar and vector bitwise instructions.We show HOBFLOPS offers significant performance boosts compared to SoftFP. We then evaluate HOBFLOPS8–HOBFLOPS16e parallel arbitrary-precision performance. We argue that our software bitslice parallel FP is both moreefficient and offers greater bit-level customizability than conventional software FP emulation.

We make the following contributions:

• We present a full design flow from a VHDL FP core generator to arbitrary-precision software bitslice parallelFP operators, optimized using hardware design tools with our logic cell libraries, and our domain-specificcode generator.

• We demonstrate how 3-input Arm NEON bitwise instructions e.g., SEL (multiplexer) and AVX512 bitwiseternary operations can be used in standard cell libraries to improve the efficiency of the generated codesignificantly.

• We present an algorithm for implementing CNN convolution with the very wide vectors that arise in bitsliceparallel vector arithmetic.

• We evaluate HOBFLOPS16 on Arm Neon and Intel AVX2 and AVX512 processors and find our approachachieves approximately 0.5×, 2.5×, and 8× the performance of Berkeley SoftFP16 respectively.

• We evaluate various widths of HOBFLOPS from HOBFLOPS8–HOBFLOPS16e and find e.g., HOBFLOPS9performs at approximately 45 million MACs/second on Arm Neon processor around 6× that of HOBFLOPS16,and 2 billion MACs/second on an Intel AVX512 platform, around 5× that of HOBFLOPS16. The increasedperformance is due to:

– Bitslice parallelism of the very wide vectorization of the MACs of the CNN;– Our efficient code generation flow.

The rest of this article is organized as follows. Section 2 gives background on other CNN accelerators use of low-precision arithmetic types. Section 3 outlines bitslice parallel operations and introduces HOBFLOPS, shows the designflow, types supported and how to implement arbitrary-precision HOBFLOPS FP arithmetic in a convolution layer of aCNN. Section 4 shows results of our comparisons of HOBFLOPS16 to Berkeley SoftFP16 on Intel’s AVX2 and AVX512processors and show significant increases in performance. We also show results for HOBFLOPS8–HOBFLOPS16eemulation implemented on Arm Neon, Intel AVX2 and AVX512 processors. We outline related work in Section 5 andconclude with Section 6.

2 Background

Reduced-precision CNN inference, particularly weight data of CNNs, reduces computational requirements due tomemory accesses, which dominate energy consumption. Energy and area costs are also reduced in ASICs and FPGAs[6].

2

A PREPRINT - NOVEMBER 18, 2021

Kang et al., [7] investigate short, reduced FP representations that do not support not-a-numbers (NANs) and infinities.They show that shortening the width of the exponent and mantissa reduces the computational complexity within themultiplier logic. They compare fixed point integer representations with varying widths up to 8-bits of their short FPin various CNNs, and show around a 1% drop in classification accuracy, with more than 60% reduction in ASICimplementation area. Their work stops at the byte boundary, leaving other arbitrary ranges open to investigation.

Researchers often use custom precision in the design of arithmetic accelerators implemented in hardware such asFPGA or ASIC. For their Project Brainwave [2, 3], Microsoft proposes MS-FP8 and MS-FP9, which are 8-bit and9-bit FP arithmetic that they exploit in a quantized CNN [2, 3]. Microsoft alters the Minifloat 8-bit that follows theIEEE-754 specification (1-sign bit, 4-exponent bits, 3-mantissa bits) [8] slightly for more efficient implementation in aFPGA by creating MS-FP8, of 1-sign bit, 5-exponent bits, and 2-mantissa bits. MS-FP8 gives a larger representativerange due to the extra exponent bit but lower accuracy than Minifloat, caused by the reduced mantissa. However, thismore than doubles the performance on a Intel Altera Stratix V D5 FPGA running at 225MHz to 4.5 tera floating-pointoperations per seconds (FLOPS) (TFLOPS) when compared to the 2.0 TFLOPS for 8-bit integer operations. Thisperformance increases to more than 90 TFLOPS on the 14nm Stratix 10 280 FPGA running at 500MHz. To improve theprecision, they propose MS-FP9, which increases the mantissa to 3 bits and keeps the exponent at 5 bits. The MS-FP8 /MS-FP9 bit-widths are suitable for FPGA implementation and application of neural net acceleration in Microsoft datacenters. Their later work [3] uses a shared exponent with their proposed MS-FP8 / MS-FP9, i.e., one exponent pair usedfor many mantissae, sharing the reduced mantissa multipliers, something this work does not investigate. Their workremains at 8- and 9-bit for FPGA implementation and leaves the research area open for other bit-precision and rangeinvestigation on microprocessor architectures.

Other researchers investigate optimizing different representations of FP arithmetic. Xu et al., [5] propose bitslice parallelarithmetic and present FP calculations undertaken on a fixed point unit. Instead of storing vectors in the traditionalsense of storing 17-bit vectors inefficiently in a 32-bit register, they instead store thirty-two 17-bit words transformedinto bitslice parallel format and stored in memory. Xu et al., manually construct bitwise arithmetic routines to performinteger or FP arithmetic, while the vectors remain in a bitslice parallel format. When coded in C/C++ and AVX2 singleinstruction multiple data (SIMD) instructions, they demonstrate this approach is efficient for low-precision vectors,such as 9-bit or 11-bit arbitrary FP types. Xu et al.’s, work manually optimizes the arithmetic leaving automating theprocess an interesting area for investigation.

There are byte precision arithmetic accelerators in software. Anderson et al., [9] exploit the SIMD nature of theprocessor by proposing flytes, which is a scheme for byte-level customizable precision FP arithmetic in the SIMDvector registers. They use SIMD instructions to convert between the custom format and native FP while packing andunpacking the data into the vector registers. Their work is at the byte-level boundary in the packing and does not supportarbitrary bit-level precision.

3 Approach

In this section, we present our approach to producing HOBFLOPS arithmetic units that we demonstrate in theconvolution layer of a CNN. HOBFLOPS is a method for generating efficient software emulation parallel FP arithmeticunits optimized using hardware synthesis tools. Figure 1 outlines our flow for creating HOBFLOPS arithmetic units.We generate the register transfer logic (RTL) representations of arbitrary-precision FP multipliers and adders using theFP unit generator, FLOating-POint COres (FloPoCo) [10]. The floating-point accuracy of storage and computation ofHOBFLOPS adders and multipliers are configured by choosing the required FP exponent and mantissa values of theVHDL produced by FloPoCo.

We produce standard cell libraries of logic gate cells supported by the bitwise logic instructions of the target micropro-cessor architecture. For example, AND, OR, XOR, the SEL (multiplexer) bitwise instructions are supported on ArmNeon. The ternary logic look up table (LUT) bitwise instructions of the Intel AVX512 are supported.

We use the ASIC synthesizer tool, Cadence Genus [11], in conjunction with our standard cell libraries and small gatearea optimization and automation script to synthesize the adders and multipliers into Verilog netlists. The open-sourcesynthesis and technology mapping suite, Yosys ASIC synthesizer [12] and ABC optimizer [13], allows us to optimizefurther and topologically sort the netlists with the same cell libraries.

Our custom domain-specific source to source generator converts the topologically sorted netlists into a C/C++ header ofbitwise operations. In parallel, we convert the hardware cell libraries into the equivalent C/C++ cell library headers ofthe target processor instruction set architecture (ISA). We create a CNN convolution layer to include the HOBFLOPSadder and multiplier header and cell library header corresponding with the target processor ISA, and compile with G++.

3

A PREPRINT - NOVEMBER 18, 2021

FP Core Generator(FloPoCo)

RTL Designs e.g.Adder VHDL

Multiplier VHDL

Synthesizer/Optimizer(Cadence Genus)

ISA SpecificLiberty Cell Library

Domain Specific Sourceto Source Generator

FP adder/multiplierConfiguration

(e.g. E/M Bit width)

Small Gate AreaOptimisation Configuration

Low Gate-countOptimized Verilog

MAC Netlists

Small Gate AreaOptimisation and Sort Configuration

Optimizer/Sorter(ABC/Yosys)

OptimizedVerilog Netlists

ISA SpecificVerilog Cell Library

Domain Specific Sourceto Source Generator

Domain Specific Sourceto Source Generator

ISA SpecificHeader Cell Library

ISA SpecificC++ Source

C Compiler(G++)

RTL Test Bench e.g.Adder VHDL

Multiplier VHDL

ISA Specific CHeader Files and

Object Files

Figure 1: Flow for Creating HOBFLOPS Bitwise Operations. (Yellow signifies third party tools)

4

A PREPRINT - NOVEMBER 18, 2021

Table 1: Cell Libraries’ Support for Bitwise Logic Operations.Arm

(64-bit)Arm Neon

(128-bit) [14]Intel

(64-bit)Intel AVX2

(128-, 256-bit)Intel AVX512(512-bit) [15]

AND A & B AND A & B AND A & B AND A & B LUT000 0OR A | B OR A | B OR A | B OR A | B LUT001 (A | (B | C)) ˆ 1XOR A ˆ B XOR A ˆ B XOR A ˆ B XOR A ˆ B LUT002 ~ (B | A) CNOT ~A NOT ~A NOT ~A NOT ~A LUT003 (B | A) ˆ 1ORN A & (~B) ORN A | ~B ANDNOT ~A & B LUT004 ~(A | C) B

SEL (~((S & A) | (~S & B))) LUT005 (C | A) ˆ1... ... (truncated)LUT253 A | (B | (C ˆ 1))LUT254 A | (B | C)LUT255 1

CIn

X

Y

Sum

Cout

(a) Full Adder Implemented in IntelAVX2 Intrinsic Bitwise Operations.

CIn

X

Y

Sum

Cout

(b) Full Adder Implemented in ArmNeon Intrinsic Bitwise Operations.

CIn

X

Y

Sum

Cout

LUT150

LUT029

(c) Full Adder Implemented in IntelAVX512 3-Input LUT Intrinsic BitwiseOperations.

Figure 2: Full Adders (Implemented in Intel’s AVX2 and AVX512 and Arm’s Neon Intrinsic Bitwise Operations.)

3.1 Arm Neon, Intel AVX2 and AVX512 Cell Libraries

We create three Synopsys Liberty [16] standard cell libraries supporting the equivalent hardware gate-level re-presentations of Arm Neon [14], Intel X86_64, AVX, AVX2 and AVX512 SIMD vector intrinsics [15]. The Arm NeonSEL (multiplexer) bitwise multiplexer instruction is a 3-input bitwise logic instruction, whereas all other Neon bitwiselogic instructions modeled in the cell library are 2-input. The ternary logic LUT of the Intel AVX512 is a 3-input bitwiselogic instruction that can implement 3-input boolean functions. An 8-bit immediate operand to this instruction specifieswhich of the 256 3-input functions should be used. We create all 256 equivalent cells in the Liberty cell library whentargeting AVX512 devices. Table 1 lists the bitwise operations supported in the cell libraries for each architecture. TheAVX512 column shows a truncated example subset of the available 256 bitwise logic instructions; see Intel’s IntrinsicsGuide [15] and Software Developers manual [17] for the complete set.

To demonstrate the capabilities of the cell libraries, we show an example of a single bit full adder. Figure 2a showsa typical 5-gate full adder implemented with our AVX2 cell library. The C code function of Listing 1 demonstratesthese gates in code, if WIDTH is set to 1. The same full adder can be implemented in three Arm Neon bitwise logicinstructions, Figure 2b, one of which is the SEL bitwise multiplexer instruction. Intel AVX512 intrinsics can implementthe full adder in two 3-input bitwise ternary instructions, Figure 2c. While the inputs and outputs of the hardware gate

1 # d e f i n e WIDTH 22 void a d d e r ( u32 *x , u32 *y , u32 *sum , u32 * c o u t ) {3 c o u t = 0 ;4 f o r ( i n t i = 0 ; i < WIDTH; i ++) {5 sum [ i ] = x [ i ] ^ y [ i ] ^ c o u t ;6 c o u t = ( c o u t & x [ i ] ^ y [ i ] ) | ( x [ i ] & y [ i ] ) ;7 }8 }

Listing 1: Bitslice Parallel Adder for Two WIDTH-bit Wide Arrays of Unsigned Integers.

5

A PREPRINT - NOVEMBER 18, 2021

Figure 3: Many 9-bit FP Values Transformed to 9-bit bitslice parallel FP Data Layout.

Figure 4: Example 512 9-bit Bitslice Parallel FP Add Operation Using AVX512 Registers.

6

A PREPRINT - NOVEMBER 18, 2021

Table 2: Comparison of Existing Custom FP.

Sign Exponent Mantissa Type & Availability1 4 3 IEEE-FP8 in Software [8]1 5 2 MS-FP8 in FPGA [2, 3]1 5 3 MS-FP9 in FPGA [2, 3]

Table 3: HOBFLOPS MAC Types, Standard and Extended (e) Precision

HOBFLOPS(IEEE)XXHOBFLOPS(IEEE)XXe

InputsBit Width

OutputsBit Width

Expo Mant Expo MantSoft FP16 [4] 5 10 5 10HOBFLOPSIEEE8 4 3 4 4HOBFLOPSIEEE8e 4 3 4 7HOBFLOPS8 5 2 5 3HOBFLOPS8e 5 2 5 5HOBFLOPS9 5 3 5 4HOBFLOPS9e 5 3 5 7HOBFLOPS10 5 4 5 5HOBFLOPS10e 5 4 5 9HOBFLOPS11 5 5 5 6HOBFLOPS11e 5 5 5 11HOBFLOPS12 5 6 5 7HOBFLOPS12e 5 6 5 13HOBFLOPS13 5 7 5 8HOBFLOPS13e 5 7 5 15HOBFLOPS14 5 8 5 9HOBFLOPS14e 5 8 5 17HOBFLOPS15 5 9 5 10HOBFLOPS15e 5 9 5 19HOBFLOPS16 5 10 5 11HOBFLOPS16e 5 10 5 21

level circuits are single bits, these inputs and outputs are parallelized by the bit-width of the SIMD vector registers. Thebitwise instructions, when converted to bitwise software operations, produce extensive parallel scaling of the arithmetic.

3.2 Bitslice Parallel Operations

HOBFLOPS exploits bitslice parallel operations to represent FP numbers in a bitwise manner that are processed inparallel. For example, many 9-bit values are transformed to bitslice parallel representations, see Figure 3. A simpleexample of how nine registers of 512-bit bitslice parallel data are applied to a 512-bit wide bitslice parallel FP adder isshown in Figure 4. Each instruction within the adder has a throughput of around half a clock cycle (see Intel’s IntrinsicsGuide for details of the precise throughput of each of the SSE, AVX2 and AVX512 logic Bitwise Operations [15]and Arm’s Intrinsics Reference [14] for details of Arm Neon bitwise operational throughput). In this example, theadder’s propagation delay is related to the number of instruction-level parallelism and associated load/store commands.The number of gates in the HOBFLOPS adder or multiplier is dependent on the required HOBFLOPS precision, seeTable 3 for examples of HOBFLOPS MAC precision, and Section 4 for associated HOBFLOPS MAC gates counts andperformance.

3.3 Design Flow

We investigate whether bitslice parallel logic can be optimized using hardware tools to reduce the hardware logicgate count or area, and subsequent lines of bitwise software operations. We use the industry-standard hardware ASICsynthesizer, Cadence Genus, with our custom-designed logic cell libraries to synthesize and optimize the bitslice parallelFP arithmetic. We use Yosys ASIC synthesizer [12] and ABC optimizer [13] to further optimize and topologically sortthe netlist.

Taking inspiration from Microsoft’s MS-FP8 / MS-FP9 [2, 3] and Minifloat 8-bit that follows the IEEE-754 specification[8], we create single-precision and extended-precision HOBFLOPS adders and multipliers. For example, we create

7

A PREPRINT - NOVEMBER 18, 2021

Figure 5: HOBFLOPS8 Multiplier Netlist with Arm Neon Cells - Note the use of the SEL (multiplexer).

the single-precision HOBFLOPS8 multiplier to take two 5-bit exponent, 2-bit mantissa, 1-bit sign inputs and producea single 5-bit exponent and 3-bit mantissa and 1-bit sign output. We also create extended-precision HOBFLOPS8emultiplier to take two 5-bit exponent, 2-bit mantissa, and 1-bit sign to produce a single 5-bit exponent, 5-bit extendedmantissa and 1-bit sign output.

When using HOBFLOPS, any quantization method may be employed, such as those investigated by Gong et al., [18].These quantized values need to be stored and computed during inference mode. Therefore we set FloPoCo to therequired range and precision. For comparison, Table 2 shows the range and precision of IEEE-FP8 of MS-FP8 andMS-FP9, respectively, available in simulated software and FPGA hardware. These FP solutions only support specificranges and do not allow for arbitrary exponent and mantissa values.

Details of the evaluated HOBFLOPS are in Table 3, although, other arbitrary combinations of mantissa and exponentbit-widths are supported in the flow (see Figure 1).

FloPoCo [10], the FP unit generator, generates VHDL [19] RTL descriptions of FP adders and multipliers of varyingexponent and mantissa bit-widths for standard and extended precision HOBFLOPS types, see Table 3. As a performancebaseline, we produce the 16- and 16e-bit IEEE-754 FP versions of the multiplier and adder, for comparison againstBerkeley’s SoftFP16. FloPoCo automatically produces the corresponding VHDL test benches and test vectors requiredto test the generated cores. FloPoCo has a slightly different way of encoding the FP numbers when compared to theIEEE-754 2019 specification and does not support subnormal numbers. A FloPoCo FP number [10] is a bit-vectorconsisting of the following 4 fields: 2-bit exception field (01 for normal numbers); a sign bit; an exponent field wE bitswide; a mantissa (fractional) field wF bits wide. The significand has an implicit leading 1, so the fraction field ff...ffrepresents the significand 1.ff...ff.

We configure FloPoCo to generate combinatorial plain RTL VHDL cores with a 1MHz frequency, no pipelining, andno use of hard-macro FPGA multipliers or adders. These settings ensure that FloPoCo generates reduced area ratherthan reduced latency multipliers and adders. We simulate the FP multipliers and adders in a VHDL simulator with thecorresponding FloPoCo generated test bench to confirm that the quantized functionality is equivalent to IEEE-754 FPmultiplier and adder.

We create Synopsys Liberty [16] standard cell libraries to support the target processor architecture. Cadence Genus(version 16.22-s033_1) [11], the industry-standard ASIC synthesis tool, synthesizes the adder and multiplier VHDLcores with our standard cell libraries and configuration and small gate area optimisation script into a Verilog [20] netlistof the logic gates. See Figure 5 for an example of the HOBFLOPS8 multiplier logic produced by FloPoCo whensynthesized with our Arm Neon cell Library. Note how Genus has synthesized the design to include the 3-input SELgate (multiplexer) supported by Arm Neon.

8

A PREPRINT - NOVEMBER 18, 2021

M

IFM

_H

IFM_W

C

K

K

C

OFM

_H

OFM_W

MLA

NES

HOBFLOPsMult

HOBFLOPsAdd̀

HOBFLOPsReLU`

NIN

BIT

S

LANES(SIMD Reg 64-512 bits Wide)

XX

S E

EEEE M

MM

01

0 0

11

11

0 0

0 . . . . . .. . . . . .

. . . . . .

NIN

BIT

S

KernalHOBFLOPS

Values

XX

S E

EEEE M

MM

01

1 0

11

11

0 0

10

1 0

10

00

1 1

0 1

01

1 0

11

01

1 0

0

. .

01

0 0

11

11

0 0

0

01

0 0

11

11

0 0

0

01

0 0

11

11

0 0

0

IFMHOBFLOPS

Values

. . . . . .

. . . . . .

. . . . . .

. . . .

LANES(SIMD Reg 64-512 bits Wide)

NO

UTB

ITS

OFMHOBFLOPS

Values

XX

S E

EEEE M

MM

M0

1 1

01

11

1 0

0 0

10

1 0

10

00

1 1

0 0

1

01

1 0

11

01

1 0

0 0. . . . . .

. . . . . .

. . . . . .

. . . .

LANES(SIMD Reg 64-512 bits Wide)

Store as HOBFLOPS Values

HOBFLOPSValues

. . . .

64-512bits Wide

64-512bits Wide

64-512bits Wide

LANES

Broadcast

Figure 6: HOBFLOPS CNN Convolution (IFM and Kernel data pre-transformed to HOBFLOPS, OFM remains inHOBFLOPs layout.)

1 void AND2X1( u256 *A, u256 *B , u256 *Y) {2 *Y = _mm256_and_si256 (*A, *B) ; }3 void NOTX1( u256 *A, u256 *Y) {4 / / I n v e r t e r c o u l d be implemented i n many ways5 *Y = _mm256_xor_si256 (*A, _mm256_set1_epi32 ( −1) ) ; }6 void OR2X1( u256 *A, u256 *B , u256 *Y) {7 *Y = _mm256_or_si256 (*A, *B) ; }8 void XOR2X1( u256 *A, u256 *B , u256 *Y) {9 *Y = _mm256_xor_si256 (*A, *B) ; }

10 void ANDNOT2X1( u256 *A, u256 *B , u256 *Y) {11 *Y = _mm256_andnot_si256 (*A, *B) ; }

Listing 2: Macros for Wide AVX2 Cell Library Bitwise Operator Definitions of Hardware Logic Gates

All HOBFLOPS designs are combinatorial, so synthesis timing constraints are not required. In the standard cell librariesLiberty file, we assign a value of 1.0 to the cell area and cell leakage power of all the cells. We configure all cellcapacitance and timing values to zero. These values ensure that the synthesizer assigns equal optimization priority to allgates and produces a netlist with the least number of logic gates rather than creating a netlist optimized for hardwaretiming propagation.

We further optimize the netlist using the open-source Yosys ASIC synthesizer [12] and ABC optimizer [13]. We useABC’s strash command to transform the current network into an AND-inverter graph (AIG) by one-level structuralhashing. We then use the refactor function to iteratively collapse and refactor the levels of logic and area of the netlist.We configure Yosys to produce a topologically sorted Verilog netlist of gates. The topological sorting is required asCadence Genus writes the netlist file in an output port to input port order, whereas the C/C++ compiler requires theconverted netlist to have input to output ordering. We formally verify the topologically sorted netlist against the original

9

A PREPRINT - NOVEMBER 18, 2021

ALGORITHM 1: HOBFLOPS Code for a Simple AVX2 2-bit Binary Full Adder (unrolled by Synthesis) - 256wide 2-bit adders in 12 Bitwise Operations.Input: x[LANES] of SIMD widthInput: y[LANES] of SIMD widthInput: cin of SIMD widthOutput: HOBFLOPS register sum[LANES] of SIMD widthOutput: HOBFLOPS register cout of SIMD widthXOR2X1(x[1], y[1], n_5); // Wide XOR OperationOR2X1(x[1], y[1], n_2); // Wide OR OperationOR2X1(cin, x[0], n_1);AND2X1(x[1], y[1], n_0); // Wide AND OperationAND2X1(cin, x[0], n_3);AND2X1(y[0], n_1, n_6);OR2X1(n_3, n_6, n_8);AND2X1(n_2, n_8, n_9);OR2X1(n_0, n_9, cout);XOR2X1(n_5, n_8, sum[1]);XOR2X1(x[0], y[0], n_4);XOR2X1(cin, n_4, sum[0]);

ALGORITHM 2: HOBFLOPS Code for a Simple AVX512 2-bit Binary Full Adder - 512 wide 2-bit adders in 4Bitwise Operations.Input: x[LANES] of SIMD widthInput: y[LANES] of SIMD widthInput: cin of SIMD widthOutput: HOBFLOPS register sum[LANES] of SIMD widthOutput: HOBFLOPS register cout of SIMD widthLUT232X1(cin, y[0], x[0], n_1); // (B&C)|A&(B ∧ C)LUT232X1(x[1], n_1, y[1], cout);LUT150X1(y[1], x[1], n_1, sum[1]); // A ∧B ∧ CLUT150X1(y[0], x[0], cin, sum[0]);

netlist with Yosys satisfiability (SAT)-solver. These netlists are re-simulated with the test bench used to simulate theFloPoCo generated VHDL designs and compared for correlation.

Our domain-specific source to source generator translates the Verilog adder and multiplier netlists to Intel AVX2,AVX512, or Arm Neon bitwise operators with the correct types and pointers. Algorithm 1 shows the HOBFLOPScode for a simple 2-bit binary adder with carry, generated from 12 bitwise operations, referenced from the cell libraryof Listing 2. A single input bit of the hardware multiplier becomes the corresponding architecture variable type e.g.,uint64 for a 64-bit processor, an __mm256i type for an AVX2 processor, an __mm512i type for an AVX512 processor,uint32x4_t type for a Neon processor. Algorithm 2 demonstrates the efficiency of the AVX512 implementation of thesame simple 2-bit adder, generated from 4 bitwise operations.

When we generate HOBFLOPS adders and multipliers, e.g., a HOBFLOPS8 multiplier targeted at the AVX2 processoris generated in 80 bitwise operations, and a HOBFLOPS8 adder is generated in 319 bitwise operations. When targetedat the AVX512 processor, the HOBFLOPS8 multiplier is generated in 53 bitwise operations, and the HOBFLOPS8adder is generated in 204 bitwise operations.

3.4 CNN Convolution with HOBFLOPS

We present a method for CNN convolution with HOBFLOPS arithmetic. We implement HOBFLOPS MACs in a CNNconvolution layer, where up to 90% of the computation time is spent in a CNN [21]. We compare HOBFLOPS MACperformance to IEEE FP MAC and to Berkeley’s SoftFP16 MulAdd function [4].

Figure 6 shows HOBFLOPS input feature map (IFM) and kernel values convolved and stored in the output feature map(OFM). To reduce cache misses, we tile the IFM H×W ×C dimensions, which for Conv dw / s2 of MobileNets CNN is14×14×512 = 100, 352 elements of HOBFLOPS IFM values. We tile the M kernel values by LANES×NINBITS,where LANES corresponds to the target architecture registers bit-width, e.g., 512-lanes corresponds to AVX512

10

A PREPRINT - NOVEMBER 18, 2021

512-bit wide register. The LANES × NINBITS tiles of binary values are transformed to NINBITS of SIMDwidth values, where SIMD width correspond to the target architecture register width, e.g., uint64 type for a 64-bitprocessor architecture, __mm512i type for AVX512 processor.

We broadcast the IFM channel tile of NINBITS across the corresponding channel of all the kernels tiles ofNINBITS to convolve image and kernel values using HOBFLOPS multipliers, adders and rectified linear unit(ReLU) activation function. The resultant convolution SIMD width values of NOUTBITS wide are stored in corre-sponding location tiles in the OFM. The HOBFLOPS IFM and kernel layout for single-precision, as defined by FloPoCois:NINBITS=EXC+SIGN+EXPO_IN+MANT_INThe HOBFLOPS OFM layout for single-precision is:NOUTBITS=EXC+SIGN+EXPO_IN+MANT_IN+1and for extended-precision:NOUTBITS=EXC+SIGN+EXPO_IN+(2×MANT_IN)+1For example, HOBFLOPS9, as can be seen in Table 3, the input layout NINBITS has 2-bit exception EXC, 1-bitsign SIGN , 5-bit exponent EXPO_IN , 3-bit mantissa MANT_IN , which added comes to 11-bits. The single-precision NOUTBITS would be 12-bits (essentially NINBITS+1). The extended-precision NOUTBITS wouldbe 15-bits.

If HOBFLOPS is implemented in a multi-layer CNN, the data between each layer could remain in HOBFLOPS formatuntil the last convolution layer. The OFM at the last convolutional layer could be transformed from HOBFLOPS valuesto floats resulting in the transformation overhead only occurring at the first and last convolutional layers of the CNNmodel. An additional pooling layer could be developed in the HOBFLOPS format, for the interface between the lastconvolutional layer and the fully connected layer of MobileNets, not done in this work.

We repeat the above for MACs up to HOBFLOPS16e on each processor architecture up to the 512-bit wide AVX512registers.

4 Evaluation

We implement each of the 8- to 16e-bit HOBFLOPS multipliers and adders in a convolution layer of the MobileNetsCNN [22]. We use layer Conv dw / s2 of MobileNets as it has a high number of channels C and kernels M , perfectfor demonstrations of high-dimensional parallelized MAC operations. We compare the HOBFLOPS16 multipliersand adders round-to-nearest-ties-to-even and round-towards-zero modes performance to Berkeley’s SoftFP16 MulAddrounding near_even and round min modes [4]. This 16-bit FP comparison acts as a baseline as Soft FP8 is not supportedby Berkeley’s emulation tool.

We target 32-bit to 512-bit registers for AVX2 and AVX512 processors and target 32- and 128-bit registers for theCortex-A15 processor. We implement 32- to 512-lanes of HOBFLOPS multipliers and adders and capture each of theAVX2 32-, 64-, 128- and 256-bit, AVX512 32-, 64-, 128-, 256 and 512-bit, and Cortex-A16 32-, 64- and 128-bit results.

Three machine types are used to test the HOBFLOPS MAC:

• Arm Cortex-A15 Neon embedded development kit, containing an ARMv7 rev 3 (v7l) CPU at 2GHz and 2GBRAM;

• Intel Core-i7 PC, containing Intel Core-i7 8700K CPU at 3.7GHz and 32GB RAM;

• Intel Xeon Gold server PC, containing Intel Xeon Gold 5120 at 2.2GHz and 256GB RAM.

Our cell libraries model-specific cells, see Table 1. We omit the bit clear (BIC) of the Arm Neon in our cell library. In-clusion of bit clear (BIC) prevents the synthesis tool, Cadence Genus, from optimizing the netlist with SEL (multiplexer)units, leading to a less area efficient netlist.

To further decrease area and increase performance, we produce round-towards-zero versions of the HOBFLOPS8–HOBFLOPS16e adders as the rounding can be dealt with at the end of the layer in the activation function, assuming thenon-rounded part of the FP value is retained through to the end of the layer.

The MACs per second of an average of 1000 iterations of a HOBFLOPS adders and multipliers are captured andcompared. We use the GNU G++ compiler to optimize the code for the underlying target microprocessor architectureand numbers of registers. We compile the HOBFLOPS CNN code (see Figure 6) to include our HOBFLOPS addersand multipliers, and our cell library (e.g., Listing 2 for AVX2) with G++ (version 8.2.1 20181127 on Arm, version9.2.0 on Intel AVX2 and version 6.3.0 20170516 on Intel AVX512 machines). We target C++ version 17 and using-march=native, -mtune=native, -fPIC, -O3 compiler switches with −msse for SSE devices and -mavx2 for AVX2

11

A PREPRINT - NOVEMBER 18, 2021

Figure 7: Arm Neon HOBFLOPS8-16e MACs Round-to-Nearest-Ties-To-Even Throughput - higher is better.

devices. When targeting an Intel AVX512 architecture we use the the -march=skylake-avx512, -mtune=skylake-avx512, -mavx512f, -fPIC, -O3 switches. When targeting an Arm Neon device we use -march=native, -mtune=native, -fPIC, -O3, -mfpu=neon to exploit the use of Neon registers.

After the G++ compilation, we inspect the assembler object dump. Within the multiplier and adder units, we findan almost one-to-one correlation of logic bitwise operations in the assembler related to the gates modeled in the celllibraries, with additional loads/stores where the compiler has seen fit to implement.

4.1 Arm Cortex-A15

We configure an Arm Cortex-A15 Development kit with 2GB RAM, ARCH Linux version 4.14.107-1-ARCH installed,and fix the processor frequency at 2GHz. We run tests for 32-, 64- and 128-lanes and capture performance. We usetaskset to lock the process to a core of the machine for measurement consistency.

Figure 7 shows 128-lane round-to-nearest-ties-to-even performance for all arbitrary-precision HOBFLOPS FP between8- and 16e-bits, IEEE 8- and 32-bit equivalents and Berkeley’s SoftFP versions. HOBFLOPS16 round-to-nearest-ties-to-even achieves approximately half the performance of SoftFP16 MulAdd rounding near_even mode on Arm Neon.However, HOBFLOPS offers arbitrary precision FP between 8- and 16-bits and can be configured for any bit-width ofmantissa and exponent of FP, outperforming SoftFP16 between HOBFLOPS8 and HOBFLOPS11 bits.

Similarly, HOBFLOPS16 round-towards-zero version shown in Figure 8 demonstrates a slight improvement in per-formance compared to Berkeley’s SoftFP16 MulAdd rounding min mode. Figure 8 also shows HOBFLOPS round-towards-zero versions have an increased performance when compared to HOBFLOPS16 round-to-nearest-ties-to-even.

HOBFLOPS appears to exhibit a fluctuation around HOBFLOPS8e and HOBFLOPS9 between Figure 7 and Fig-ure 8. While there is 1-bit more in the input mantissa of HOBFLOPS9 compared to HOBFLOPS8e, which leads toHOBFLOPS9 containing larger adder/accumulators, Figure 8 shows the round-towards-zero HOBFLOPS9 functionalityalmost counter-intuitively exhibiting slightly greater throughput than HOBFLOPS8e. The greater throughput of theround-towards-zero HOBFLOPS9 is due to the lack of rounding adder, lower gate area and thus reduced latency.

12

A PREPRINT - NOVEMBER 18, 2021

Figure 8: Arm Neon HOBFLOPS8-16e MACs Round-Towards-Zero Throughput - higher is better.

Figure 9: Arm Neon HOBFLOPS8-16e MAC Gate Count: Round To Nearest, Ties To Even - lower is better.

13

A PREPRINT - NOVEMBER 18, 2021

Figure 10: Arm Neon HOBFLOPS8-16e MAC Gate Count: Round Towards Zero - lower is better.

The low bit-width and thus reduced hardware synthesis gate count or area as seen in Figure 9 and Figure 10 wouldbenefit memory storage and bandwidth within the embedded system allowing for reduced energy consumption, however,energy consumption is not measured here.

4.2 Intel AVX2

We configure an Intel Core i7-8700K desktop machine with 32GB RAM, and ARCH Linux 5.3.4-arch1-1 installed. Forconsistency of performance measurements of various HOBFLOPS configurations, within the BIOS we disable:

• Intel’s SpeedStep (i.e., prevent the CPU performance from ramping up and down);

• Multi-threading (i.e., do not split the program into separate threads);

• TurboBoost (i.e., keep all processor cores running at the same frequency);

• Hyperthreading Control (i.e., keep one program on one processor core);

• C-States control (i.e., prevent power saving from ramping down the processor core clock frequency).

We alter the GRUB configuration so intel_pstate (i.e., lock the processor core clock frequency in the Linux Kernel) andintel_cstate are disabled on both GRUB_CMDLINE_LINUX and GRUB_CMDLINE_LINUX_DEFAULT. This BIOS andLinux Kernel configuration ensures the processor frequency is fixed at 4.6GHz, no power-saving, and each HOBFLOPSinstance running at full performance on a single thread and single CPU core. When executing the compiled code,taskset is used to lock the process to a single core of the CPU. These configurations allow a reproducible comparison oftiming performance of each HOBFLOPS configuration against Berkeley’s SoftFP16.

We run tests for 32-, 64-, 128- and 256-lanes and capture performance. Figure 11 shows 256-lane round-to-nearest-ties-to-even results for all arbitrary-precision HOBFLOPS FP between 8- and 16e-bits, IEEE 8- and 32-bit equivalents andBerkeley’s SoftFP versions. HOBFLOPS16 performs over 2.5× higher MACs/second when compared to Berkeley’sSoftFP16 MulAdd rounding near_even mode. The round-towards-zero version of HOBFLOPS16 performs at around2.7× higher MACs/second when compared to Berkeley’s SoftFP16 MulAdd rounding min mode. In fact, HOBFLOPSoutperforms SoftFP16 for all versions of between HOBFLOPS8 and HOBFLOPS16e for both round-to-nearest-ties-to-even and round-towards-zero rounding modes.

14

A PREPRINT - NOVEMBER 18, 2021

Figure 11: Intel AVX2 HOBFLOPS8-16e MACs Round-to-Nearest-Ties-To-Even Throughput - higher is better.

Figure 12: Intel AVX2 HOBFLOPS8-16e MAC Gate Count: Round To Nearest, Ties To Even - lower is better.

15

A PREPRINT - NOVEMBER 18, 2021

Figure 13: Intel AVX512 HOBFLOPS8-16e MACs Round-to-Nearest-Ties-To-Even Throughput - higher is better.

HOBFLOPS8 performance gain is due to reduced synthesis area of the HOBFLOPS units as seen in Figure 12. Again,this reduction in area, also seen for round-towards-zero, is key to reduced SIMD bitwise operations being created forthe HOBFLOPS MACs and therefore reduced latency through the software HOBFLOPS MACs.

4.3 Intel AVX512

We configure an Intel Xeon Gold 5120 server-class machine with 256GB RAM, and Debian Linux 4.9.189-3+deb9u2installed. As this is a shared server-grade machine, we cannot change the BIOS or pin the clock to its maximumfrequency as done for the AVX2-based machine. However, taskset is used to lock the process to a single CPU core.

We run tests for 32-, 64-, 128-, 256- and 512-lanes and capture 512-lane performance. Figure 13 shows 512-laneround-to-nearest-ties-to-even results for all arbitrary-precision HOBFLOPS FP between 8- and 16e-bits, IEEE 8- and32-bit equivalents and Berkeley’s SoftFP versions. HOBFLOPS16 performs with 8.2× greater MACs throughput whencompared to Berkeley’s SoftFP16 MulAdd rounding near_even mode. For the 512-lane round-towards-zero results,HOBFLOPS16 performs at 8.4× the MACs throughput when compared to Berkeley’s SoftFP16 MulAdd rounding minmode. HOBFLOPS outperforms SoftFP16 for all versions of between HOBFLOPS8 and HOBFLOPS16e for bothround-to-nearest-ties-to-even and round-towards-zero. HOBFLOPS9 performs at approximately 2 billion MACs/second,around 5× the performance of HOBFLOPS16.

HOBFLOPS performance is due to HOBFLOPS lower hardware synthesis area, which when the netlists are convertedto software bitwise operations, translates to fewer SIMD 3-input ternary logic LUT instructions in the MACs. As seenin Figure 14, HOBFLOPS16 area on the AVX512 platform is 38% smaller than the HOBFLOPS16 area on AVX2. Afurther slight performance boost is seen for round-towards-zero.

5 Related Work

CNNs require a great deal of computation, so, many researchers focus on proposing methods of optimizing FP arithmeticwithin the CNNs either at the register level or the bit-precision level. Traditionally, researchers propose SIMD registerand data path optimizations from within the compiler. An early example from Fisher et al., [23] is the SIMD within aregister (SWAR) model which treats a wide data path within a processor as multiple, thin SIMD parallel data paths.This model accelerates arithmetic, allowing wide registers to be partitioned for smaller operations, such as subdividing

16

A PREPRINT - NOVEMBER 18, 2021

Figure 14: Intel AVX512 HOBFLOPS8-16e MAC Gate Count: Round To Nearest, Ties To Even - lower is better.

32-bit integer adders into eight 8-bit adders, assuming one adder’s carry to the next adder is suppressed. The work ofFisher et al., is an early work of efficient register packing which inspires the bitslice packing of our work.

Researchers investigate different bit precision and representations of the FP number base. Google’s tensor processingunit (TPU) ASIC [1] implements brain floating point 16-bit (bfloat16) [24], a 16-bit truncated IEEE-754 FP single-precision format. bfloat16 preserves dynamic range of the 32-bit format due to the 8-bit exponent. The precision isreduced in the mantissa from IEEE’s 24-bits down to 7-bits. bfloat16 is not restricted to machine learning; Intel Nervanause it in their network processing unit (NPU), Intel Xeon processors support AVX512 bfloat16 extensions, Intel Altera’sFPGAs use it, Arm’s ARMv8.6-A architecture, and ROCm libraries use bfloat16. This work only reduces the bit-widthof the mantissa and does not investigate different exponent or mantissa bit-widths. Our work addresses these arbitraryexponent and mantissa bit-widths.

Nvidia has proposed the new tensor float 32 (TP32) number representation [25], which is a sign bit, 8-bit exponent and10-bit mantissa. This 19-bit floating-point format is an input format which truncates the IEEE FP32 mantissa to 10-bits.TP32 still produces 32-bit floating-point results to move and store in the graphics processor unit (GPU). Similar toGoogle, Nvidia does not investigate other bit-widths of FP range and precision, something our work does.

Typically 32-bit or 64-bit precision FP arithmetic is used for inference or training of a CNN. Jeff Johnson at FacebookResearch [26] shows the need for lower-power and faster performing artificial intelligence (AI) designs for mobiledevices and data-centers for both inference and training. They suggest improving existing FP implementations topropose a more efficient FP arithmetic, which keeps the same precision and dynamic range while reducing power.While this work only investigates performance, Johnson shows that with changes in range and precision that lead toincreased performance, these changes typically lead to overall reduced energy consumption.

Garland et al., [27] show that they can vary the bit-width of their parallel accumulate shared MAC (PASM) between4-bit and 32-bit in ASIC and maintain performance and accuracy while reducing power and area of the multiplier. Intheir follow up work, Garland et al., [28] show PASM can be implemented on an FPGA and vary the bit-width betweenINT8 and 32-bit, saving significant energy with only a slight increase in latency and no change in classification accuracy.Further echoing Johnson’s work, Garland et al.’s, work demonstrates that the use of varying bit widths can save energy,suggesting our work would exhibit lower energy, although energy is not measured in our work.

17

A PREPRINT - NOVEMBER 18, 2021

Note there are converter tools, such as Verilator, that converts hardware description language (HDL) code like Verilog toC/C++. Verilator does not merely convert Verilog netlists to C/C++ as Verilator’s focus is on fast, optimized, threadedcompiled C++ model of Verilog for simulation purposes. Verilator does not produce a circuit in a format convertible tobitwise instructions, something our custom code-generator does.

6 Conclusion

We propose HOBFLOPS, a method of generating fast custom-precision emulated bitslice parallel software FP arithmeticusing a hardware design flow, our cell libraries and custom code-generator. We generate efficient software-emulated FPoperators with an arbitrary precision mantissa and exponent.

HOBFLOPS is advantageous in CNN convolution for increased precision, accuracy and throughput. We also show thatHOBFLOPS offers arbitrary-precision FP with custom range and precision values, useful for FP CNN accelerationwhere memory storage and bandwidth are limited.

We experiment with large numbers of channels and kernels in CNN convolution. When HOBFLOPS16 MAC isimplemented in the convolution layer on Arm Neon and Intel AVX2 and AVX512 processors and compared toBerkeley’s SoftFP16 MulAdd FP emulation, HOBFLOPS achieves approximately 0.5×, 2.5× and 8× the performanceof SoftFP16 respectively. We show e.g., HOBFLOPS9 performs at approximately 2 billion MACs/second on anAVX512, around 5× the performance of HOBFLOPS16 and approximately 45 million MACs/second on Arm Neonprocessor around 6× that of HOBFLOPS16.

The performance gains are due to the optimized hardware synthesis area of the MACs, which translates to fewer bitwiseoperations. Additionally, the bitslice parallelism of the very wide vectorization of the MACs of CNN convolutioncontributes to the performance boost. While we show results for 8- and 16-bit with a fixed exponent, HOBFLOPSsupports the emulation of any required precision of FP arithmetic at any bit-width of mantissa or exponent, e.g., FP9containing a 1-bit sign, 5-bit exponent and 3-bit mantissa, or FP11 containing a 1-bit sign, 5-bit exponent and 5-bitmantissa, not supported with other software FP emulation methods.

References

[1] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates,Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, JeremyCoriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, WilliamGulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, AaronJaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, SteveLacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, GordonMacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni,Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, AmirSalek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, AndySwing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter,Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter performance analysis of a tensor processing unit.SIGARCH Comput. Archit. News, 45(2):1–12, June 2017.

[2] E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield, T. Massengill, M. Liu, D. Lo, S. Alkalay,M. Haselman, M. Abeydeera, L. Adams, H. Angepat, C. Boehn, D. Chiou, O. Firestein, A. Forin, K. S. Gatlin,M. Ghandi, S. Heil, K. Holohan, A. El Husseini, T. Juhasz, K. Kagi, R. Kovvuri, S. Lanka, F. van Megen,D. Mukhortov, P. Patel, B. Perez, A. Rapsang, S. Reinhardt, B. Rouhani, A. Sapek, R. Seera, S. Shekar,B. Sridharan, G. Weisz, L. Woods, P. Yi Xiao, D. Zhang, R. Zhao, and D. Burger. Serving dnns in real time atdatacenter scale with project brainwave. IEEE Micro, 38(2):8–20, Mar 2018.

[3] Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay,Michael Haselman, Logan Adams, and Mahdi Ghandi. A configurable cloud-scale dnn processor for real-time ai.In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pages 1–14, LosAngeles, California, USA, 2018. IEEE ISCAS.

[4] John Hauser. The softfloat and testfloat validation suite for binary floating-point arithmetic. University ofCalifornia, Berkeley, Tech. Rep, 1999.

[5] S. Xu and D. Gregg. Bitslice vectors: A software approach to customizable data precision on processors withsimd extensions. In 2017 46th International Conference on Parallel Processing (ICPP), pages 442–451, USA,Aug 2017. IEEE.

18

A PREPRINT - NOVEMBER 18, 2021

[6] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. Efficient processing of deep neural networks: Atutorial and survey. Proceedings of the IEEE, 105(12):2295–2329, 2017.

[7] Hyeong-Ju Kang. Short floating-point representation for convolutional neural network inference. IEICE ElectronicsExpress, page 15.20180909, 2018.

[8] IEEE. IEEE standard for floating-point arithmetic. IEEE Std 754-2019 (Revision of IEEE 754-2008), pages 1–84,July 2019.

[9] Andrew Anderson, Servesh Muralidharan, and David Gregg. Efficient multibyte floating point data formats usingvectorization. IEEE Transactions on Computers, 66(12):2081–2096, 2017.

[10] F. de Dinechin, C. Klein, and B. Pasca. Generating high-performance custom floating-point pipelines. In 2009International Conference on Field Programmable Logic and Applications, pages 59–64, USA, Aug 2009. IEEE.

[11] Cadence. Cadence genus synthesis solution, 2016.[12] J. Kepler C. Wolf, J. Glaser. Yosys - a free verilog synthesis suite. In Proceedings of the 21st Austrian Workshop

on Microelectronics, Austrochip 2013, USA, 2013. Springer.[13] Robert Brayton and Alan Mishchenko. Abc: An academic industrial-strength verification tool. In International

Conference on Computer Aided Verification, Computer Aided Verification, pages 24–40, Berlin, Heidelberg, 2010.Springer, Springer Berlin Heidelberg.

[14] Arm. Arm neon intrinsics reference for acle q2 2019, 25 May 2020 2019.[15] Intel. Intel intrinsics guide, 2020.[16] Synopsys. Liberty user guides and reference manual suite version 2018.06, October 2018.[17] Intel. Intel 64 and ia-32 architectures software developer’s manual, October 2018.[18] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks using

vector quantization. arXiv preprint arXiv:1412.6115, 2014.[19] IEEE. IEEE standard VHDL language reference manual. IEEE Std 1076-2008 (Revision of IEEE Std 1076-2002),

pages c1–626, Jan 2009.[20] IEEE. IEEE standard for verilog hardware description language. IEEE Std 1364-2005 (Revision of IEEE Std

1364-2001), pages 1–590, April 2006.[21] C. Farabet, B. Martini, P. Akselrod, S. Talay, Y. LeCun, and E. Culurciello. Hardware accelerated convolutional

neural networks for synthetic vision systems. In Proceedings of 2010 IEEE International Symposium on Circuitsand Systems, pages 257–260, Paris, France, May 2010. IEEE ISCAS.

[22] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, MarcoAndreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861, 2017.

[23] Randall J Fisher and Henry G Dietz. Compiling for simd within a register. In International Workshop onLanguages and Compilers for Parallel Computing, pages 290–305, Berlin, Heidelberg, 1998. Springer.

[24] Google. Bfloat16: The secret to high performance on cloud TPUs, October 2019.[25] NVidia. Nvidia a100 tensor core gpu architecture, 2020.[26] Jeff Johnson. Rethinking floating point for deep learning. arXiv preprint arXiv:1811.01721, 2018.[27] J. Garland and D. Gregg. Low complexity multiply accumulate unit for weight-sharing convolutional neural

networks. IEEE Computer Architecture Letters, 16(2):132–135, July 2017.[28] James Garland and David Gregg. Low complexity multiply-accumulate units for convolutional neural networks

with weight-sharing. ACM Trans. Archit. Code Optim., 15(3):31:1–31:24, September 2018.

19