Download pdf - Laconic Deep Learning Computing€¦ · Sayeh Sharify, Mostafa Mahmoud, Alberto Delmas Lascorz, Milos Nikolic, Andreas Moshovos Abstract Laconic: Deep Neural Network inference accelerator

Laconic Deep Learning Computing

Sayeh Sharify, Mostafa Mahmoud, Alberto Delmas Lascorz, Milos Nikolic, Andreas Moshovos

Abstract

Laconic: Deep Neural Network inference accelerator

•Targets primarily CNNs, can do any layer type•Exploits effectual bit content for activations and weights•Term-serial accelerator

20.5x faster than data-parallel accelerators2.6x more energy efficient 26% area cost

Motivation

•Previous accelerators exploit ineffectualcomputations including zero skipping, precisionvariability, and zero bit skipping to improve theperformance and energy efficiency of CNNs.•We show that policy “At+Wt”, eliminating the

ineffectual computations at the bit level for boththe activations and the weights, has the highestpotential speedup.

Conventional bit-parallel acceleratorExecution time does not vary with data bit content

0

15

0

15

+

X

X

activations

weights

16

16

16

16

32

32

Laconic’s Approach

Laconic Processing Element

Results Relative to Baseline

8×8

4×3= 5.3×

• Performance gain

• Convolutional layer:𝑃𝑏𝑎𝑠𝑒×𝑃𝑏𝑎𝑠𝑒

⟨𝑡𝑎⟩×⟨𝑡𝑤⟩

• Fully-connected layer: 𝑃𝑏𝑎𝑠𝑒

⟨𝑡𝑤⟩

• Data access reduction

• Activations: 𝑃𝑏𝑎𝑠𝑒−𝑃𝑎

𝑃𝑎

• Weights:𝑃𝑏𝑎𝑠𝑒−𝑃𝑤

𝑃𝑤

0101011000101010

Baseline

+

+

4

4

4

4

5

5

135

0

0 1246

035

activations

weights

1247

Dec.

Dec.

+H

isto

gram

Laconic: Term-Serial AcceleratorExecution time varies with the number of

effectual terms Throughput:

Energy efficiency:

Enhanced Adder Tree

Design Configurations

Laconic Baseline

• More area and energy efficient adder tree• Divides the output of histogram stage to 6

groups, G0, G1, …, G5.• Outputs within the same group have no

overlapping bits that are “1”.• Concatenates the outputs within a group to

calculate their sum.

• Inputs:

• 16 Activations, 1 term/cycle

• 16 Weights, 1 term/cycle

• Multiplication is done term-serially

• Reduces the products to a single output

• Supports cascading for smaller layers

S: Sparse

S: Sparse

PE Array (8 x16)256 input

activations128 weights

16 output activation

Datapath Layout

0

5

10

15

20

25

30

35

256A-128W 256A-256W 256A-512W 256A-1KW

0

1

2

3

4

5

6

256A-128W 256A-256W 256A-512W 256A-1KW

• 65nm TSMC

• Clock frequency: 980 MHz

• 128W Tile:

• Area: 943422.48 µm2

• Power: 224.1 mW

• Operations Per Second:

752 GOPS

• Operations Per Second,

Per Watt: 763.4 GOPS/W

SPE(0,0) SPE(0,15)

SPE(7,0) SPE(7,15)

Activation Memory

We

igh

t M

em

ory

256

12

8Execution Engine

Laconic Architecture

1.6

9

1.8

4

2.4

7

1.8

9

2.8

7

2.7

8

2.1

2

2.2

9

2.2

1

1.9

6

15

.16

2.8

2

8.5

6

3.6

6 5.7

6

2.7

9

3.1

9

4.3

8

14

.27

19

.73

20

.24

17

.15

28

.13

26

6.7

9

19

.13

22

.09

27

.32

178.68

1543.66

215.69

675.81403.05

65

9.5

0

267.13 330.15 418.59

1.00

10.00

100.00

1000.00

10000.00

Po

ten

tial

Sp

ee

du

p

A A+W At At+Wt