TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Acceleratorspeople.iiis.tsinghua.edu.cn/~gaomy/pubs/slides/tangram... · 2019. 4. 25. · 1 t t M LM e Monolithic Base Tiled

TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators

Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis

Stanford UniversityTsinghua University

Google

ASPLOS – April 2019

Neural Networks (NNs)q Unprecedented accuracy for challenging applications

o Fully-connected (MLPs), Convolutional (CNNs), Recurrent (LSTMs) NNs

q Inference: layer-wise processing on direct acyclic graphs (DAGs)

2

Conv

Conv

Conv

FC

FC

din

dout

Convolutional NN

ct−1 ht−1

ct ht

xt

I-GateF-Gate

O-Gate

×+

×FC

×tanh

|

LSTM Cell

din

dout

1 × 1Conv

1 × 1Conv

3 × 3Pool1 × 1

Conv 3 × 3Conv

5 × 5Conv

1 × 1Conv

|

Inception Module

NN Acceleratorsq Domain-specific processing engine

o An array of specialized processing elements (PEs)o On-chip register files and SRAMso 100x performance and energy efficiency

q Diannao/Cambricon, Google TPU, Eyeriss, Cnvlutin, EIE, …

3

ALU

Reg File

Processing Element

PE PE PE PE

PE PE PE PE

PE PE PE PE

PE PE PE PE

Glob

al B

uffe

r

NN Processing Engine

Scaling NN Performanceq Use more PEs & more on-chip buffers

q Monolithic engineûLow resource utilizationûLong array busesûFar from SRAM

q Tiled architecture—focus of our workü Mostly local data transfersü Easy to scale up/down? Dataflow scheduling

4

Mem

0

Mem

2M

em 3

Mem

1

GBuf Array

GBuf Array

GBuf Array

GBuf Array

GBuf Array

GBuf Array

GBuf Array

GBuf Array

GBuf Array

GBuf Array

GBuf Array

GBuf Array

Glob

al B

uffe

r

Monolithic Arrayof PEs

Mem

1M

em 0

TANGRAM: Optimizing Coarse-Grained Dataflow

q Intra-layer parallelism

q Buffer Sharing Dataflowo Reuse data across enginesà higher energy efficiency

o Avoid on-chip data duplicationà smaller buffer area

q Inter-layer pipelining

q Fine-grained data forwarding &pipelining of complex DAGso Reduce pipeline stallsà higher throughput

o Temporarily store forwarded dataà smaller buffer area

5

Array

GBu

f

Array

GBu

f

Array

GBu

f

Array

GBu

f

Intra-Layer Parallelism

6

Parallelizing a Single Layer

q Inefficient buffer use for shared dataûReplicated buffered data (area)ûData reuse limited within each tile (energy)

q ALL parallelization schemes share some data! 7

Nb

Ni

*Ni No

=

No

Nb

Ifmaps OfmapsWeights

foreach b in batch Nbforeach ifmap i in Niforeach ofmap o in No// 2D convO[b][o] += I[b][i] * W[o][i] O[0][0] O[0][1]

O[1][0] O[1][1]

I[0][0:1] I[0][0:1]

I[1][0:1] I[1][0:1]

W[0][0:1]

W[0][0:1]

W[1][0:1]

W[1][0:1]

Optimizing Dataflow for Shared Data

q Skew computation order of engineso All engines start in parallel à high throughput

q Rotate buffered data between engineso Fully reuse shared data à low energyo No on-chip data duplication à low area 8

O[0][0] O[0][1]

O[1][0] O[1][1]

W[0][1]W[0][0]

I[0][1]I[0][0]

W[1][0]W[1][1]

I[1][0]I[1][1]

Buffer Sharing Dataflowq Unify distributed buffers as an ideal large buffer

o Efficiently store and reuse data

q Formalize as loop transformationso (tile coordinate x, time step t) -> index of data to be buffered io See paper for detailed maths

q Easy to implemento Buffer controller fetches from memory or other tileso No changes for dataflow within a tile

q Support all parallelization schemes (including hybrid)9

Inter-Layer Pipelining

10

Pipelining Multiple Layers

q Pros: avoid off-chip intermediate data accesseso Save DRAM bandwidth and energy

q Cons: utilize resources less efficientlyo Long delays: pipeline filling/draining due to inter-layer data dependencieso Large SRAM buffers: store entire intermediate data

11

Layer 1

Layer 2

Layer 3

Layer 4

Laye

r 1La

yer 3

Laye

r 2

Fine-Grained Data Forwardingq Forward each subset of data to the next layer as soon as ready

o Reduce pipeline stalls: next layer starts earliero Reduce buffer capacity: only store the subset currently being forwarded

q Require matched access patterns between adjacent layers

12

foreach b in batch Nbforeach ifmap i in Niforeach ofmap o in No// 2D convO[b][o] += I[b][i] * W[o][i]

No dependencies; trivially pipelined Ifmaps

Ofmaps 0 1 2

… …

Time

foreach ofmap o in Noforeach ifmap i in Ni// 2D conv

0 1 2

……

Ifmaps

OfmapsTime

foreach ifmap i in Niforeach ofmap o in No// 2D conv

Layer1

Time

Layer2

Layer3

Alternate Layer Loop Ordering (ALLO)

13

0 1 2

… …

0 1 2

……0 1 2

… …

0 1 2

… …

Unoptimized Optimized

Layer1

Time

Layer2

Layer3

0 1 2

… …

0 1 2

… …

Buffer forALL fmaps

Delay forALL fmaps

Delay forONE fmap

Buffer forONE fmap

Benefits apply to half of all layers

Layer Pipelining for Complex NN DAGsq A dataflow tool explores pipeline schedules of multiple layersq Subject to design rules due to data dependency constraints

o E.g., no multiple predecessor layers on-chip

14

R1

R2

R3

R01 × 1 Conv

3 × 3 Conv

1 × 1 Conv

1 × 1 Conv+

R5R3

R2

R1

R4 R6R0

1 × 1Conv

1 × 1Conv

3 × 3Pool1 × 1

Conv 3 × 3Conv

5 × 5Conv

1 × 1Conv

R2

R3

R1

R0

I-Gate

F-Gate

O-Gate

×+

×FC

×tanh

Inception Module ResNet module LSTM Cell

Evaluation Results

15

Modeling Methodologyq State-of-the-art NNs

o CNNs: AlexNet, VGGNet, GoogLeNet, ResNeto MLPs & LSTMs: medium and large scales

q Hardwareo Inference engine: Eyeriss [ISCA’16], 8 × 8 PEs, 32 kB buffer, 500 MHzo Off-chip memory: LPDDR3-1600, 4 channelso Overall chip: 16 x 16 tiles

• 16384 PEs + 8 MB SRAM• 90 mm2 at 28 nm

16

Overall Comparison

q Base tiled vs. monolithic: 3.6x performance, 7% worse energyo Less flexible and less efficient use of on-chip SRAM buffers

q TANGRAM: 2x over base tiled, outperforms monolithic17

0

0.5

1

AlexNet

VGGNet

GoogLeNet

ResNet

MLP-M

MLP-L

LSTM

-MLS

TM-L

Tim

e

Monolithic Base Tiled TANGRAM

0

1

2

AlexNet

VGGNet

GoogLeNet

ResNet

MLP-M

MLP-L

LSTM

-MLS

TM-L

Ener

gy

Monolithic Base Tiled TANGRAM

Intra- vs. Inter-Layer Optimizations

q Intra-layer: Buffer Sharingo AlexNet: fit large fmaps on-chipo MLP-L: enable weight pinning

q Inter-layer: ALLO + complex DAGso AlexNet, GoogLeNet & LSTM-Mo Linear NNs benefit less

18

0

0.5

1

1.5

2

2.5

AlexNet GoogLeNet MLP-L LSTM-M

Ener

gy

TANGRAM w/o Intra w/o Inter

4.59

Summary

q Efficiently scale NN acceleration

o Coarse-grained parallel dataflow on tiled architectures

o Optimized tiled architectures outperform monolithic engines

q TANGRAM: dataflow optimizations

o Intra-layer buffer sharing

o Inter-layer pipelining with fine-grained data forwarding

o Pipelining complex NN DAGs

q Dataflow scheduling tool open sourced

o https://github.com/stanford-mast/nn_dataflow

19

Thank you!

https://github.com/stanford-mast/nn_dataflow

Documents

TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Acceleratorspeople.iiis.tsinghua.edu.cn/~gaomy/pubs/slides/tangram... · 2019. 4. 25. · 1 t t M LM e Monolithic Base Tiled