Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Rangharajan Venkatesan, Yakun Sophia Shao, Brian Zimmer, Jason Clemons, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Priyanka Raina, Stephen G. Tell, Yanqing Zhang, William J. Dally, Joel S. Emer, C. Thomas Gray, Stephen W. Keckler & Brucek Khailany
A 0.11 PJ/OP, 0.32-128 TOPS, SCALABLE MULTI-CHIP-MODULE-BASED DEEP NEURAL NETWORK ACCELERATOR DESIGNED WITH A HIGH-PRODUCTIVITY VLSI METHODOLOGY
2© NVIDIA 2019
RESEARCH TESTCHIP GOALS
NVIDIA RESEARCH OVERVIEW
Research Teams:
Graphics, Deep Learning, Robotics, Computer Vision, Parallel Architectures, Programming Systems, Circuits, VLSI, Networks
Recent Works:
TESTCHIPS
Develop and Demonstrate Underlying Technologies for Efficient DL Inference
RC12
RC16
RC17
RC13
RC18
RTX NVSwitch
CuDNN CNN Image Inpainting
Noise-to-Noise Denoising
Progressive GAN
• Scalable architecture for
DL inference acceleration
• High-productivity Design Methodology
This Work
3© NVIDIA 2019
VAST WORLD OF AI INFERENCECreating A Massive Market Opportunity
GENERAL PURPOSE COMPUTERS EMBEDDED COMPUTERS EMBEDDED DEVICES
4© NVIDIA 2019
TARGET APPLICATIONSImage Classification with Convolutional Neural Networks
AlexNet
DriveNet ResNet
Deep Learning Models
Different MCM configurations
Image Classification
Ref: Krizhevskyet al., NeurIPS, 2012. Bojarski et al., CoRR2016. He et al., CoRR 2015
5© NVIDIA 2019
SCALABLE DEEP LEARNING INFERENCE ACCELERATOR
6© NVIDIA 2019
MULTI-CHIP-MODULE (MCM) ARCHITECTURE
Demonstrate:Low-effort scaling to high inference performance
Ground Reference Signaling (GRS) as an MCM interconnectNetwork-on-Package architecture
Advantages:Overcome reticle limits
Higher yield
Lower design cost
Mix process technologiesAgility in changing product SKUs
Challenges:Area and power overheads for inter-chip interfaces
Ref: Zimmer et al., VLSI 2019
Package
Chip
7© NVIDIA 2019
HIERARCHICAL COMMUNICATION ARCHITECTURENetwork-on-Package (NoP) and Network-on-Chip (NoC)
6x6 mesh topology connects 36 chips in package.
A single NoP router per chip with 4 interface ports to NoC
Configurable routing to avoid bad links/chip
~20ns per hop, 100 Gbps per link (at max)
4x5 mesh topology connects 16 PEs, one Global PE, and one RISC-V
Cut-through routing with Multicast support
10ns per hop, ~70Gbps per link (at 0.72V)
NETWORK-ON-CHIP (NoC)
NETWORK-ON-PACKAGE (NoP)
8© NVIDIA 2019
GROUND-REFERENCED SIGNALING (GRS)
High Speed
11-25 Gbps per pin
High energy efficiency
Low voltage swing (~200mV)
0.82-1.75 pJ/bit
High area efficiency
Single-ended links
4 data bumps + 1 clock bump per GRS link
High Bandwidth, Energy-efficient Inter-chip Communication
GRS Macro
Ref: Poulton et al., JSSC 2019
9© NVIDIA 2019
SCALABLE DL INFERENCE ACCELERATORTiled Architecture with Distributed Memory
Ref: Zimmer et al., VLSI 2019
Package
Chip
Vector MAC
Vector MAC
Vector MAC
Vector MAC
Vector MAC
Vector MAC
Vector MAC
Vector MAC
Manager
Input
Buffer
Accumulation Buffer
+ +Manager
AddrGen
Addr
Gen
Distributed
Weight Buffer
R
Router InterfaceSerdes
PPU
Trunc ReLU
Pooling Bias
Ref: Sijstermans et al., HotChips 2018
10© NVIDIA 2019
SCALABLE DL INFERENCE ACCELERATORCNN Layer Execution
H
W
C
RS
P
Q
K
K
C
Input Activations Output ActivationsWeights
Distribute weights across PEs
Load Input Activation to Global PE
RISC-V configures control registers
Stream input activations to PEs
Store output activations to Global PE
11© NVIDIA 2019
SCALING DL INFERENCE ACROSS NOP/NOCTiling Convolutional Layer Across Chips and Processing Elements
H
W
C
RS
P
Q
K
K
C
Chip 0 Chip 1
Chip 2 Chip 3
Cchip
Kchip
Cchip
Kchip
Cchip
Kchip
Cchip
Kchip
Cchip
C
Kchip
K
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
Cpe
Kpe
Cchip
Kchip
12© NVIDIA 2019
Core area 111.6 mm2
Voltage 0.52-1.1 V
Frequency 0.48-1.8 GHz
FABRICATED MCM-BASED ACCELERATORNVResearch Prototype: 36 Chips on Package in TSMC 16nm Technology
High speed interconnects using Ground Reference Signaling (GRS)
100 Gbps per link
Efficient Compute tiles
9.5 TOPS/W, 128 TOPS
Low Design Effort
Spec-to-Tapeout in 6 months with
<10 researchers
Ref: Zimmer et al., VLSI 2019
13© NVIDIA 2019
HIGH PRODUCTIVITY DESIGN METHODOLOGY
14© NVIDIA 2019
HIGH-PRODUCTIVITY DESIGN APPROACHEnables faster time-to-market and more features to each SoC
RAISE HARDWARE DESIGN LEVEL OF ABSTRACTION
Use High-level languages
e.g. C++ instead of Verilog
Use Automation
e.g. High-Level Synthesis (HLS)
Use libraries/generators
MatchLib
AGILE VLSI DESIGN
Small teams, jointly working on architecture, implementation, VLSI
Continuous integration with automated tool flows
Agile project management techniques
24-hour spins from C++-to-layout
15© NVIDIA 2019
OBJECT-ORIENTED HIGH-LEVEL SYNTHESIS
Leverage HLS tools to design with C++ and SystemC models
MatchLib: Modular Approach To Circuits and Hardware Library
“STL/Boost” for Hardware Design
Synthesizable hardware library developed by NVIDIA researchHighly-parameterized, high QoR implementation
Available open-source: https://github.com/NVlabs/matchlib
Latency-Insensitive (LI) ChannelsEnable modularity in design process
Decouple computation & communication architectures
“Push-button” C++-to-gates flow
Ref: Khailany et al., DAC 2018
6 months from spec-to-tapeoutwith <10 researchers
MatchLib
(C++/SystemC)
C++ simulation (Functional & Perf. verif)
HLS
RTL
LI Channels
(SystemC)
Architectural Model
(C++/SystemC)
Verification Testbench
(SystemC)
16© NVIDIA 2019
PROCESSING ELEMENT IMPLEMENTATIONReuse, Modularity, Hierarchical Design
17© NVIDIA 2019
AGILE VLSI DESIGN TECHNIQUES Daily “C++ to Layout” Spins
GDSGates
Syn (2 hrs)HLS+Verilog gen (3 hrs) Place & Route (12 hrs)
RTLC++
void dut () {if (rst.read() == 1) {count = 0;counter_out.write(count);} else if (enable.read() == 1) {
count = count + 1;counter_out.write(count);
}}
QoR
Agile, incremental approach to design closure during march-to-tapeout phase
Small, abutting partitions for fast place and route iterations
Globally asynchronous locally synchronous pausible adaptive clocking scheme
Fast, error-free clock domain crossings
“Correct by construction” top-level timing closure
RTL bugs, performance, and VLSI constraints converge together
Ref: Fojtik et al., ASYNC 2019
18© NVIDIA 2019
EXPERIMENTAL RESULTS
19© NVIDIA 2019
MEASUREMENT SETUP
Measurements begin after weights and activations are loaded from FPGA DRAM
Weights are loaded to PE memory
Activations are loaded to Global PE
Operating pointsMax. Performance: 1.1V
Min. Energy: 0.55V
20© NVIDIA 2019
RESULTS: WEAK SCALING WITH DRIVENET
PERFORMANCE ENERGY EFFICIENCY
2.38.7
63
0.43 0.460.51
0
0.2
0.4
0.6
0.8
1
0
20
40
60
80
1 4 32
Late
ncy
(m
s)
Thro
ughput
(im
ages/
sec)
Thousa
nds
Number of Chips
Throughput Latency
Batch = 1 Batch = 4 Batch = 32
95 120 131
95
207
0
100
200
300
400
1 4 32
Energ
y (u
J/im
age)
Number of Chips
Core Energy GRS Energy
Batch = 1 Batch = 4 Batch = 32
Scaling to 32 chips achieves 27X improvement in performance over 1 chip.
Energy proportionality in core energy consumption with weak scaling.
GRS energy can be reduced with sleep mode.
Ref: Bojarskiet al., CoRR2016
(Voltage: 1.1V) (Voltage: 0.55V)
21© NVIDIA 2019
RESULTS: STRONG SCALING WITH RESNET-50
PERFORMANCE ENERGY EFFICIENCY
Scaling to 32 chips achieves 12X improvement in performance over 1 chip at Batch = 1.
Communication and synchronization overheads limit speed-up.
High energy efficiency at different scales.
Small overhead from communication as we scale number of chips.
Ref: He et al., CoRR2015
2.9 3.66.3
1.8
6.1
0
5
10
15
1 4 32
Energ
y (m
J/im
age)
Number of Chips
Core Energy GRS Energy
216 632
2,548
4.64
1.58
0.39
0
1
2
3
4
5
0
1000
2000
3000
1 4 32
Late
ncy
(m
s)
Thro
ughput
(im
ages/
sec)
Number of Chips
Throughput Latency (Voltage: 1.1V) (Voltage: 0.55V)
22© NVIDIA 2019
SUMMARYNVResearch Test Chip
Package
ChipScalable Inference Accelerator
Uses MCM to address different markets with one architecture
0.11 pJ/Op (8b) and 128 TOPS (111.6 core mm2) across 36 chips connected via Ground-Referencing Signaling in a single package
Achieved 2.5K images/sec with 0.4 ms latency on ResNet-50 batch = 1
High Productivity Design Methodology
Enables faster time-to-market and more features to each SoC
10X reduction in ASIC design and verification efforts
23© NVIDIA 2019
ACKNOWLEDGMENTS
Research sponsored by DARPA under the CRAFT program (PM: Linton Salmon).
NVIDIA Collaborators: Frans Sijstermans, Dan Smith, Don Templeton, Guy Peled, Jim Dobbins, Ben Boudaoud, Randall Laperriere, Borhan Moghadam, Sunil Sudhakaran, Zuhair Bokharey, Sankara Rajapandian, James Chen, John Hu, Vighnesh Iyer, Angshuman Parashar for architecture, package, PCB, signal integrity, fabrication, and emulation support.
Catapult HLS team from Mentor, A Siemens Business: Bryan Bowyer, Stuart Clubb, Moises Garcia, and Khalid Islam for discussions and support.
Collaborative Effort Across Architecture and Design Methodology
This research was, in part, funded by the U.S. Government, under the DARPA CRAFT program. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
Distribution Statement "A" (Approved for Public Release, Distribution Unlimited).
SCALABLE MULTI-CHIP-MODULE-BASED DEEP LEARNING INFERENCE ACCELERATOR