Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Laconic Deep Learning Computing
Sayeh Sharify, Mostafa Mahmoud, Alberto Delmas Lascorz, Milos Nikolic, Andreas Moshovos
Abstract
Laconic: Deep Neural Network inference accelerator
•Targets primarily CNNs, can do any layer type•Exploits effectual bit content for activations and weights•Term-serial accelerator
20.5x faster than data-parallel accelerators2.6x more energy efficient 26% area cost
Motivation
•Previous accelerators exploit ineffectualcomputations including zero skipping, precisionvariability, and zero bit skipping to improve theperformance and energy efficiency of CNNs.•We show that policy “At+Wt”, eliminating the
ineffectual computations at the bit level for boththe activations and the weights, has the highestpotential speedup.
Conventional bit-parallel acceleratorExecution time does not vary with data bit content
0
15
0
15
+
X
X
activations
weights
16
16
16
16
32
32
Laconic’s Approach
Laconic Processing Element
Results Relative to Baseline
8×8
4×3= 5.3×
• Performance gain
• Convolutional layer:𝑃𝑏𝑎𝑠𝑒×𝑃𝑏𝑎𝑠𝑒
⟨𝑡𝑎⟩×⟨𝑡𝑤⟩
• Fully-connected layer: 𝑃𝑏𝑎𝑠𝑒
⟨𝑡𝑤⟩
• Data access reduction
• Activations: 𝑃𝑏𝑎𝑠𝑒−𝑃𝑎
𝑃𝑎
• Weights:𝑃𝑏𝑎𝑠𝑒−𝑃𝑤
𝑃𝑤
0101011000101010
Baseline
+
+
4
4
4
4
5
5
135
0
0 1246
035
activations
weights
1247
Dec.
Dec.
+H
isto
gram
Laconic: Term-Serial AcceleratorExecution time varies with the number of
effectual terms Throughput:
Energy efficiency:
Enhanced Adder Tree
Design Configurations
Laconic Baseline
• More area and energy efficient adder tree• Divides the output of histogram stage to 6
groups, G0, G1, …, G5.• Outputs within the same group have no
overlapping bits that are “1”.• Concatenates the outputs within a group to
calculate their sum.
• Inputs:
• 16 Activations, 1 term/cycle
• 16 Weights, 1 term/cycle
• Multiplication is done term-serially
• Reduces the products to a single output
• Supports cascading for smaller layers
S: Sparse
S: Sparse
PE Array (8 x16)256 input
activations128 weights
16 output activation
Datapath Layout
0
5
10
15
20
25
30
35
256A-128W 256A-256W 256A-512W 256A-1KW
0
1
2
3
4
5
6
256A-128W 256A-256W 256A-512W 256A-1KW
• 65nm TSMC
• Clock frequency: 980 MHz
• 128W Tile:
• Area: 943422.48 µm2
• Power: 224.1 mW
• Operations Per Second:
752 GOPS
• Operations Per Second,
Per Watt: 763.4 GOPS/W
SPE(0,0) SPE(0,15)
SPE(7,0) SPE(7,15)
Activation Memory
We
igh
t M
em
ory
256
12
8Execution Engine
Laconic Architecture
1.6
9
1.8
4
2.4
7
1.8
9
2.8
7
2.7
8
2.1
2
2.2
9
2.2
1
1.9
6
15
.16
2.8
2
8.5
6
3.6
6 5.7
6
2.7
9
3.1
9
4.3
8
14
.27
19
.73
20
.24
17
.15
28
.13
26
6.7
9
19
.13
22
.09
27
.32
178.68
1543.66
215.69
675.81403.05
65
9.5
0
267.13 330.15 418.59
1.00
10.00
100.00
1000.00
10000.00
Po
ten
tial
Sp
ee
du
p
A A+W At At+Wt