Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators
Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis
Stanford UniversityTsinghua University
ASPLOS – April 2019
Neural Networks (NNs)q Unprecedented accuracy for challenging applications
o Fully-connected (MLPs), Convolutional (CNNs), Recurrent (LSTMs) NNs
q Inference: layer-wise processing on direct acyclic graphs (DAGs)
2
Conv
Conv
Conv
FC
FC
din
dout
Convolutional NN
ct−1 ht−1
ct ht
xt
I-GateF-Gate
O-Gate
×+
×FC
×tanh
|
LSTM Cell
din
dout
1 × 1Conv
1 × 1Conv
3 × 3Pool1 × 1
Conv 3 × 3Conv
5 × 5Conv
1 × 1Conv
|
Inception Module
NN Acceleratorsq Domain-specific processing engine
o An array of specialized processing elements (PEs)o On-chip register files and SRAMso 100x performance and energy efficiency
q Diannao/Cambricon, Google TPU, Eyeriss, Cnvlutin, EIE, …
3
ALU
Reg File
Processing Element
PE PE PE PE
PE PE PE PE
PE PE PE PE
PE PE PE PE
Glob
al B
uffe
r
NN Processing Engine
Scaling NN Performanceq Use more PEs & more on-chip buffers
q Monolithic engineûLow resource utilizationûLong array busesûFar from SRAM
q Tiled architecture—focus of our workü Mostly local data transfersü Easy to scale up/down? Dataflow scheduling
4
Mem
0
Mem
2M
em 3
Mem
1
GBuf Array
GBuf Array
GBuf Array
GBuf Array
GBuf Array
GBuf Array
GBuf Array
GBuf Array
GBuf Array
GBuf Array
GBuf Array
GBuf Array
Glob
al B
uffe
r
Monolithic Arrayof PEs
Mem
1M
em 0
TANGRAM: Optimizing Coarse-Grained Dataflow
q Intra-layer parallelism
q Buffer Sharing Dataflowo Reuse data across enginesà higher energy efficiency
o Avoid on-chip data duplicationà smaller buffer area
q Inter-layer pipelining
q Fine-grained data forwarding &pipelining of complex DAGso Reduce pipeline stallsà higher throughput
o Temporarily store forwarded dataà smaller buffer area
5
Array
GBu
f
Array
GBu
f
Array
GBu
f
Array
GBu
f
Intra-Layer Parallelism
6
Parallelizing a Single Layer
q Inefficient buffer use for shared dataûReplicated buffered data (area)ûData reuse limited within each tile (energy)
q ALL parallelization schemes share some data! 7
Nb
Ni
*Ni No
=
No
Nb
Ifmaps OfmapsWeights
foreach b in batch Nbforeach ifmap i in Niforeach ofmap o in No// 2D convO[b][o] += I[b][i] * W[o][i] O[0][0] O[0][1]
O[1][0] O[1][1]
I[0][0:1] I[0][0:1]
I[1][0:1] I[1][0:1]
W[0][0:1]
W[0][0:1]
W[1][0:1]
W[1][0:1]
Optimizing Dataflow for Shared Data
q Skew computation order of engineso All engines start in parallel à high throughput
q Rotate buffered data between engineso Fully reuse shared data à low energyo No on-chip data duplication à low area 8
O[0][0] O[0][1]
O[1][0] O[1][1]
W[0][1]W[0][0]
I[0][1]I[0][0]
W[1][0]W[1][1]
I[1][0]I[1][1]
Buffer Sharing Dataflowq Unify distributed buffers as an ideal large buffer
o Efficiently store and reuse data
q Formalize as loop transformationso (tile coordinate x, time step t) -> index of data to be buffered io See paper for detailed maths
q Easy to implemento Buffer controller fetches from memory or other tileso No changes for dataflow within a tile
q Support all parallelization schemes (including hybrid)9
Inter-Layer Pipelining
10
Pipelining Multiple Layers
q Pros: avoid off-chip intermediate data accesseso Save DRAM bandwidth and energy
q Cons: utilize resources less efficientlyo Long delays: pipeline filling/draining due to inter-layer data dependencieso Large SRAM buffers: store entire intermediate data
11
Layer 1
Layer 2
Layer 3
Layer 4
Laye
r 1La
yer 3
Laye
r 2
Fine-Grained Data Forwardingq Forward each subset of data to the next layer as soon as ready
o Reduce pipeline stalls: next layer starts earliero Reduce buffer capacity: only store the subset currently being forwarded
q Require matched access patterns between adjacent layers
12
foreach b in batch Nbforeach ifmap i in Niforeach ofmap o in No// 2D convO[b][o] += I[b][i] * W[o][i]
No dependencies; trivially pipelined Ifmaps
Ofmaps 0 1 2
… …
Time
foreach ofmap o in Noforeach ifmap i in Ni// 2D conv
0 1 2
……
Ifmaps
OfmapsTime
foreach ifmap i in Niforeach ofmap o in No// 2D conv
Layer1
Time
Layer2
Layer3
Alternate Layer Loop Ordering (ALLO)
13
0 1 2
… …
0 1 2
……0 1 2
… …
0 1 2
… …
Unoptimized Optimized
Layer1
Time
Layer2
Layer3
0 1 2
… …
0 1 2
… …
Buffer forALL fmaps
Delay forALL fmaps
Delay forONE fmap
Buffer forONE fmap
Benefits apply to half of all layers
Layer Pipelining for Complex NN DAGsq A dataflow tool explores pipeline schedules of multiple layersq Subject to design rules due to data dependency constraints
o E.g., no multiple predecessor layers on-chip
14
R1
R2
R3
R01 × 1 Conv
3 × 3 Conv
1 × 1 Conv
1 × 1 Conv+
R5R3
R2
R1
R4 R6R0
1 × 1Conv
1 × 1Conv
3 × 3Pool1 × 1
Conv 3 × 3Conv
5 × 5Conv
1 × 1Conv
R2
R3
R1
R0
I-Gate
F-Gate
O-Gate
×+
×FC
×tanh
Inception Module ResNet module LSTM Cell
Evaluation Results
15
Modeling Methodologyq State-of-the-art NNs
o CNNs: AlexNet, VGGNet, GoogLeNet, ResNeto MLPs & LSTMs: medium and large scales
q Hardwareo Inference engine: Eyeriss [ISCA’16], 8 × 8 PEs, 32 kB buffer, 500 MHzo Off-chip memory: LPDDR3-1600, 4 channelso Overall chip: 16 x 16 tiles
• 16384 PEs + 8 MB SRAM• 90 mm2 at 28 nm
16
Overall Comparison
q Base tiled vs. monolithic: 3.6x performance, 7% worse energyo Less flexible and less efficient use of on-chip SRAM buffers
q TANGRAM: 2x over base tiled, outperforms monolithic17
0
0.5
1
AlexNet
VGGNet
GoogLeNet
ResNet
MLP-M
MLP-L
LSTM
-MLS
TM-L
Tim
e
Monolithic Base Tiled TANGRAM
0
1
2
AlexNet
VGGNet
GoogLeNet
ResNet
MLP-M
MLP-L
LSTM
-MLS
TM-L
Ener
gy
Monolithic Base Tiled TANGRAM
Intra- vs. Inter-Layer Optimizations
q Intra-layer: Buffer Sharingo AlexNet: fit large fmaps on-chipo MLP-L: enable weight pinning
q Inter-layer: ALLO + complex DAGso AlexNet, GoogLeNet & LSTM-Mo Linear NNs benefit less
18
0
0.5
1
1.5
2
2.5
AlexNet GoogLeNet MLP-L LSTM-M
Ener
gy
TANGRAM w/o Intra w/o Inter
4.59
Summary
q Efficiently scale NN acceleration
o Coarse-grained parallel dataflow on tiled architectures
o Optimized tiled architectures outperform monolithic engines
q TANGRAM: dataflow optimizations
o Intra-layer buffer sharing
o Inter-layer pipelining with fine-grained data forwarding
o Pipelining complex NN DAGs
q Dataflow scheduling tool open sourced
o https://github.com/stanford-mast/nn_dataflow
19
Thank you!