1
Laconic Deep Learning Computing Sayeh Sharify, Mostafa Mahmoud, Alberto Delmas Lascorz, Milos Nikolic, Andreas Moshovos Abstract Laconic: Deep Neural Network inference accelerator Targets primarily CNNs, can do any layer type Exploits effectual bit content for activations and weights Term-serial accelerator 20.5x faster than data-parallel accelerators 2.6x more energy efficient 26% area cost Motivation Previous accelerators exploit ineffectual computations including zero skipping, precision variability, and zero bit skipping to improve the performance and energy efficiency of CNNs. We show that policy “At+Wt”, eliminating the ineffectual computations at the bit level for both the activations and the weights, has the highest potential speedup. Conventional bit-parallel accelerator Execution time does not vary with data bit content 0 15 0 15 + X X activations weights 16 16 16 16 32 32 Laconic’s Approach Laconic Processing Element Results Relative to Baseline 8×8 4×3 = 5.3× Performance gain Convolutional layer: × 〉×〈 Fully-connected layer: Data access reduction Activations: Weights: 01010110 00101010 Baseline + + 4 4 4 4 5 5 1 3 5 0 1 2 4 6 0 3 5 activations weights 1 2 4 7 Dec. Dec. + Histogram Laconic: Term-Serial Accelerator Execution time varies with the number of effectual terms Throughput: Energy efficiency: Enhanced Adder Tree Design Configurations Laconic Baseline More area and energy efficient adder tree Divides the output of histogram stage to 6 groups, G0, G1, …, G5. Outputs within the same group have no overlapping bits that are “1”. Concatenates the outputs within a group to calculate their sum. Inputs: 16 Activations, 1 term/cycle 16 Weights, 1 term/cycle Multiplication is done term-serially Reduces the products to a single output Supports cascading for smaller layers S: Sparse S: Sparse PE Array (8 x16) 256 input activations 128 weights 16 output activation Datapath Layout 0 5 10 15 20 25 30 35 256A-128W 256A-256W 256A-512W 256A-1KW 0 1 2 3 4 5 6 256A-128W 256A-256W 256A-512W 256A-1KW 65nm TSMC Clock frequency: 980 MHz 128W Tile: Area: 943422.48 μm 2 Power: 224.1 mW Operations Per Second: 752 GOPS Operations Per Second, Per Watt: 763.4 GOPS/W S PE(0,0) S PE(0,15) S PE(7,0) S PE(7,15) Activation Memory Weight Memory 256 128 Execution Engine Laconic Architecture 1.69 1.84 2.47 1.89 2.87 2.78 2.12 2.29 2.21 1.96 15.16 2.82 8.56 3.66 5.76 2.79 3.19 4.38 14.27 19.73 20.24 17.15 28.13 266.79 19.13 22.09 27.32 178.68 1543.66 215.69 675.81 403.05 659.50 267.13 330.15 418.59 1.00 10.00 100.00 1000.00 10000.00 Potential Speedup A A+W At At+Wt

Laconic Deep Learning Computing€¦ · Sayeh Sharify, Mostafa Mahmoud, Alberto Delmas Lascorz, Milos Nikolic, Andreas Moshovos Abstract Laconic: Deep Neural Network inference accelerator

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Laconic Deep Learning Computing€¦ · Sayeh Sharify, Mostafa Mahmoud, Alberto Delmas Lascorz, Milos Nikolic, Andreas Moshovos Abstract Laconic: Deep Neural Network inference accelerator

Laconic Deep Learning Computing

Sayeh Sharify, Mostafa Mahmoud, Alberto Delmas Lascorz, Milos Nikolic, Andreas Moshovos

Abstract

Laconic: Deep Neural Network inference accelerator

•Targets primarily CNNs, can do any layer type•Exploits effectual bit content for activations and weights•Term-serial accelerator

20.5x faster than data-parallel accelerators2.6x more energy efficient 26% area cost

Motivation

•Previous accelerators exploit ineffectualcomputations including zero skipping, precisionvariability, and zero bit skipping to improve theperformance and energy efficiency of CNNs.•We show that policy “At+Wt”, eliminating the

ineffectual computations at the bit level for boththe activations and the weights, has the highestpotential speedup.

Conventional bit-parallel acceleratorExecution time does not vary with data bit content

0

15

0

15

+

X

X

activations

weights

16

16

16

16

32

32

Laconic’s Approach

Laconic Processing Element

Results Relative to Baseline

8×8

4×3= 5.3×

• Performance gain

• Convolutional layer:𝑃𝑏𝑎𝑠𝑒×𝑃𝑏𝑎𝑠𝑒

⟨𝑡𝑎⟩×⟨𝑡𝑤⟩

• Fully-connected layer: 𝑃𝑏𝑎𝑠𝑒

⟨𝑡𝑤⟩

• Data access reduction

• Activations: 𝑃𝑏𝑎𝑠𝑒−𝑃𝑎

𝑃𝑎

• Weights:𝑃𝑏𝑎𝑠𝑒−𝑃𝑤

𝑃𝑤

0101011000101010

Baseline

+

+

4

4

4

4

5

5

135

0

0 1246

035

activations

weights

1247

Dec.

Dec.

+H

isto

gram

Laconic: Term-Serial AcceleratorExecution time varies with the number of

effectual terms Throughput:

Energy efficiency:

Enhanced Adder Tree

Design Configurations

Laconic Baseline

• More area and energy efficient adder tree• Divides the output of histogram stage to 6

groups, G0, G1, …, G5.• Outputs within the same group have no

overlapping bits that are “1”.• Concatenates the outputs within a group to

calculate their sum.

• Inputs:

• 16 Activations, 1 term/cycle

• 16 Weights, 1 term/cycle

• Multiplication is done term-serially

• Reduces the products to a single output

• Supports cascading for smaller layers

S: Sparse

S: Sparse

PE Array (8 x16)256 input

activations128 weights

16 output activation

Datapath Layout

0

5

10

15

20

25

30

35

256A-128W 256A-256W 256A-512W 256A-1KW

0

1

2

3

4

5

6

256A-128W 256A-256W 256A-512W 256A-1KW

• 65nm TSMC

• Clock frequency: 980 MHz

• 128W Tile:

• Area: 943422.48 µm2

• Power: 224.1 mW

• Operations Per Second:

752 GOPS

• Operations Per Second,

Per Watt: 763.4 GOPS/W

SPE(0,0) SPE(0,15)

SPE(7,0) SPE(7,15)

Activation Memory

We

igh

t M

em

ory

256

12

8Execution Engine

Laconic Architecture

1.6

9

1.8

4

2.4

7

1.8

9

2.8

7

2.7

8

2.1

2

2.2

9

2.2

1

1.9

6

15

.16

2.8

2

8.5

6

3.6

6 5.7

6

2.7

9

3.1

9

4.3

8

14

.27

19

.73

20

.24

17

.15

28

.13

26

6.7

9

19

.13

22

.09

27

.32

178.68

1543.66

215.69

675.81403.05

65

9.5

0

267.13 330.15 418.59

1.00

10.00

100.00

1000.00

10000.00

Po

ten

tial

Sp

ee

du

p

A A+W At At+Wt