A 0.11 PJ/OP, 0.32-128 TOPS, SCALABLE MULTI-CHIP- MODULE ...€¦ · 1 0 20 40 60 80 1 4 32 (ms) s) Number of Chips Throughput Latency Batch = 1Batch = 4 Batch = 32 95 120 131 95

Rangharajan Venkatesan, Yakun Sophia Shao, Brian Zimmer, Jason Clemons, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Priyanka Raina, Stephen G. Tell, Yanqing Zhang, William J. Dally, Joel S. Emer, C. Thomas Gray, Stephen W. Keckler & Brucek Khailany

A 0.11 PJ/OP, 0.32-128 TOPS, SCALABLE MULTI-CHIP-MODULE-BASED DEEP NEURAL NETWORK ACCELERATOR DESIGNED WITH A HIGH-PRODUCTIVITY VLSI METHODOLOGY

2© NVIDIA 2019

RESEARCH TESTCHIP GOALS

NVIDIA RESEARCH OVERVIEW

Research Teams:

Graphics, Deep Learning, Robotics, Computer Vision, Parallel Architectures, Programming Systems, Circuits, VLSI, Networks

Recent Works:

TESTCHIPS

Develop and Demonstrate Underlying Technologies for Efficient DL Inference

RC12

RC16

RC17

RC13

RC18

RTX NVSwitch

CuDNN CNN Image Inpainting

Noise-to-Noise Denoising

Progressive GAN

• Scalable architecture for

DL inference acceleration

• High-productivity Design Methodology

This Work

3© NVIDIA 2019

VAST WORLD OF AI INFERENCECreating A Massive Market Opportunity

GENERAL PURPOSE COMPUTERS EMBEDDED COMPUTERS EMBEDDED DEVICES

4© NVIDIA 2019

TARGET APPLICATIONSImage Classification with Convolutional Neural Networks

AlexNet

DriveNet ResNet

Deep Learning Models

Different MCM configurations

Image Classification

Ref: Krizhevskyet al., NeurIPS, 2012. Bojarski et al., CoRR2016. He et al., CoRR 2015

5© NVIDIA 2019

SCALABLE DEEP LEARNING INFERENCE ACCELERATOR

6© NVIDIA 2019

MULTI-CHIP-MODULE (MCM) ARCHITECTURE

Demonstrate:Low-effort scaling to high inference performance

Ground Reference Signaling (GRS) as an MCM interconnectNetwork-on-Package architecture

Advantages:Overcome reticle limits

Higher yield

Lower design cost

Mix process technologiesAgility in changing product SKUs

Challenges:Area and power overheads for inter-chip interfaces

Ref: Zimmer et al., VLSI 2019

Package

Chip

7© NVIDIA 2019

HIERARCHICAL COMMUNICATION ARCHITECTURENetwork-on-Package (NoP) and Network-on-Chip (NoC)

6x6 mesh topology connects 36 chips in package.

A single NoP router per chip with 4 interface ports to NoC

Configurable routing to avoid bad links/chip

~20ns per hop, 100 Gbps per link (at max)

4x5 mesh topology connects 16 PEs, one Global PE, and one RISC-V

Cut-through routing with Multicast support

10ns per hop, ~70Gbps per link (at 0.72V)

NETWORK-ON-CHIP (NoC)

NETWORK-ON-PACKAGE (NoP)

8© NVIDIA 2019

GROUND-REFERENCED SIGNALING (GRS)

High Speed

11-25 Gbps per pin

High energy efficiency

Low voltage swing (~200mV)

0.82-1.75 pJ/bit

High area efficiency

Single-ended links

4 data bumps + 1 clock bump per GRS link

High Bandwidth, Energy-efficient Inter-chip Communication

GRS Macro

Ref: Poulton et al., JSSC 2019

9© NVIDIA 2019

SCALABLE DL INFERENCE ACCELERATORTiled Architecture with Distributed Memory


Package

Chip

Vector MAC

Vector MAC

Vector MAC

Vector MAC

Vector MAC

Vector MAC

Vector MAC

Vector MAC

Manager

Input

Buffer

Accumulation Buffer

+ +Manager

AddrGen

Addr

Gen

Distributed

Weight Buffer

R

Router InterfaceSerdes

PPU

Trunc ReLU

Pooling Bias

Ref: Sijstermans et al., HotChips 2018

10© NVIDIA 2019

SCALABLE DL INFERENCE ACCELERATORCNN Layer Execution

H

W

C

RS

P

Q

K

K

C

Input Activations Output ActivationsWeights

Distribute weights across PEs

Load Input Activation to Global PE

RISC-V configures control registers

Stream input activations to PEs

Store output activations to Global PE

11© NVIDIA 2019

SCALING DL INFERENCE ACROSS NOP/NOCTiling Convolutional Layer Across Chips and Processing Elements

H

W

C

RS

P

Q

K

K

C

Chip 0 Chip 1

Chip 2 Chip 3

Cchip

Kchip

Cchip

Kchip

Cchip

Kchip

Cchip

Kchip

Cchip

C

Kchip

K

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

Cpe

Kpe

Cchip

Kchip

12© NVIDIA 2019

Core area 111.6 mm2

Voltage 0.52-1.1 V

Frequency 0.48-1.8 GHz

FABRICATED MCM-BASED ACCELERATORNVResearch Prototype: 36 Chips on Package in TSMC 16nm Technology

High speed interconnects using Ground Reference Signaling (GRS)

100 Gbps per link

Efficient Compute tiles

9.5 TOPS/W, 128 TOPS

Low Design Effort

Spec-to-Tapeout in 6 months with

<10 researchers


13© NVIDIA 2019

HIGH PRODUCTIVITY DESIGN METHODOLOGY

14© NVIDIA 2019

HIGH-PRODUCTIVITY DESIGN APPROACHEnables faster time-to-market and more features to each SoC

RAISE HARDWARE DESIGN LEVEL OF ABSTRACTION

Use High-level languages

e.g. C++ instead of Verilog

Use Automation

e.g. High-Level Synthesis (HLS)

Use libraries/generators

MatchLib

AGILE VLSI DESIGN

Small teams, jointly working on architecture, implementation, VLSI

Continuous integration with automated tool flows

Agile project management techniques

24-hour spins from C++-to-layout

15© NVIDIA 2019

OBJECT-ORIENTED HIGH-LEVEL SYNTHESIS

Leverage HLS tools to design with C++ and SystemC models

MatchLib: Modular Approach To Circuits and Hardware Library

“STL/Boost” for Hardware Design

Synthesizable hardware library developed by NVIDIA researchHighly-parameterized, high QoR implementation

Available open-source: https://github.com/NVlabs/matchlib

Latency-Insensitive (LI) ChannelsEnable modularity in design process

Decouple computation & communication architectures

“Push-button” C++-to-gates flow

Ref: Khailany et al., DAC 2018

6 months from spec-to-tapeoutwith <10 researchers

MatchLib

(C++/SystemC)

C++ simulation (Functional & Perf. verif)

HLS

RTL

LI Channels

(SystemC)

Architectural Model

(C++/SystemC)

Verification Testbench

(SystemC)

https://github.com/NVlabs/matchlib

16© NVIDIA 2019

PROCESSING ELEMENT IMPLEMENTATIONReuse, Modularity, Hierarchical Design

17© NVIDIA 2019

AGILE VLSI DESIGN TECHNIQUES Daily “C++ to Layout” Spins

GDSGates

Syn (2 hrs)HLS+Verilog gen (3 hrs) Place & Route (12 hrs)

RTLC++

void dut () {if (rst.read() == 1) {count = 0;counter_out.write(count);} else if (enable.read() == 1) {

count = count + 1;counter_out.write(count);

}}

QoR

Agile, incremental approach to design closure during march-to-tapeout phase

Small, abutting partitions for fast place and route iterations

Globally asynchronous locally synchronous pausible adaptive clocking scheme

Fast, error-free clock domain crossings

“Correct by construction” top-level timing closure

RTL bugs, performance, and VLSI constraints converge together

Ref: Fojtik et al., ASYNC 2019

18© NVIDIA 2019

EXPERIMENTAL RESULTS

19© NVIDIA 2019

MEASUREMENT SETUP

Measurements begin after weights and activations are loaded from FPGA DRAM

Weights are loaded to PE memory

Activations are loaded to Global PE

Operating pointsMax. Performance: 1.1V

Min. Energy: 0.55V

20© NVIDIA 2019

RESULTS: WEAK SCALING WITH DRIVENET

PERFORMANCE ENERGY EFFICIENCY

2.38.7

63

0.43 0.460.51

0

0.2

0.4

0.6

0.8

1

0

20

40

60

80

1 4 32

Late

ncy

(m

s)

Thro

ughput

(im

ages/

sec)

Thousa

nds

Number of Chips

Throughput Latency

Batch = 1 Batch = 4 Batch = 32

95 120 131

95

207

0

100

200

300

400

1 4 32

Energ

y (u

J/im

age)

Number of Chips

Core Energy GRS Energy

Batch = 1 Batch = 4 Batch = 32

Scaling to 32 chips achieves 27X improvement in performance over 1 chip.

Energy proportionality in core energy consumption with weak scaling.

GRS energy can be reduced with sleep mode.

Ref: Bojarskiet al., CoRR2016

(Voltage: 1.1V) (Voltage: 0.55V)

21© NVIDIA 2019

RESULTS: STRONG SCALING WITH RESNET-50

PERFORMANCE ENERGY EFFICIENCY

Scaling to 32 chips achieves 12X improvement in performance over 1 chip at Batch = 1.

Communication and synchronization overheads limit speed-up.

High energy efficiency at different scales.

Small overhead from communication as we scale number of chips.

Ref: He et al., CoRR2015

2.9 3.66.3

1.8

6.1

0

5

10

15

1 4 32

Energ

y (m

J/im

age)

Number of Chips

Core Energy GRS Energy

216 632

2,548

4.64

1.58

0.39

0

1

2

3

4

5

0

1000

2000

3000

1 4 32

Late

ncy

(m

s)

Thro

ughput

(im

ages/

sec)

Number of Chips

Throughput Latency (Voltage: 1.1V) (Voltage: 0.55V)

22© NVIDIA 2019

SUMMARYNVResearch Test Chip

Package

ChipScalable Inference Accelerator

Uses MCM to address different markets with one architecture

0.11 pJ/Op (8b) and 128 TOPS (111.6 core mm2) across 36 chips connected via Ground-Referencing Signaling in a single package

Achieved 2.5K images/sec with 0.4 ms latency on ResNet-50 batch = 1

High Productivity Design Methodology

Enables faster time-to-market and more features to each SoC

10X reduction in ASIC design and verification efforts

23© NVIDIA 2019

ACKNOWLEDGMENTS

Research sponsored by DARPA under the CRAFT program (PM: Linton Salmon).

NVIDIA Collaborators: Frans Sijstermans, Dan Smith, Don Templeton, Guy Peled, Jim Dobbins, Ben Boudaoud, Randall Laperriere, Borhan Moghadam, Sunil Sudhakaran, Zuhair Bokharey, Sankara Rajapandian, James Chen, John Hu, Vighnesh Iyer, Angshuman Parashar for architecture, package, PCB, signal integrity, fabrication, and emulation support.

Catapult HLS team from Mentor, A Siemens Business: Bryan Bowyer, Stuart Clubb, Moises Garcia, and Khalid Islam for discussions and support.

Collaborative Effort Across Architecture and Design Methodology

This research was, in part, funded by the U.S. Government, under the DARPA CRAFT program. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

Distribution Statement "A" (Approved for Public Release, Distribution Unlimited).

SCALABLE MULTI-CHIP-MODULE-BASED DEEP LEARNING INFERENCE ACCELERATOR

Documents

A 0.11 PJ/OP, 0.32-128 TOPS, SCALABLE MULTI-CHIP- MODULE ...€¦ · 1 0 20 40 60 80 1 4 32 (ms) s) Number of Chips Throughput Latency Batch = 1Batch = 4 Batch = 32 95 120 131 95