THE NVIDIA DEEP LEARNING ACCELERATOR · 2018-08-19 · Encourage Deep Learning applications Invite...

THE NVIDIA DEEP LEARNING ACCELERATOR

INTRODUCTION

Developed as part of Xavier – NVIDIA’s SOC for autonomous driving applications

Optimized for Convolutional Neural Networks (CNNs), computer vision

Open source architecture and RTL release

Encourage Deep Learning applications

Invite contributions from the community

Targeted towards edge devices, IoT

Industry standard formats and parameterized

NVDLA — NVIDIA Deep Learning Accelerator

CNN INFERENCEConvolutional Neural Network

conv1 conv2 full1 full2

Convolutional and fully connected layers

CNN INFERENCEOne Convolutional Layer

Filter Weights

Convolution result

K filters

Output Activations

Input Activations

convolution post-processing

ARCHITECTURE OVERVIEWTop Level Architecture

SM SM SM SM

SDRAM Internal RAM

Configuration and control block

Post-processing

Memory interface

Input activations

Filter weights

Convolution core

Control Bus

Convolutional Buffer

ARCHITECTURE OVERVIEW

Convolutional buffer size vs Memory Bandwidth trade off

If conv buffer can fit 1/N’th of total weights, activations need to be read N times

Example: GoogleNet layer inception 4a/3x3, 16-bit precision

Input activations: 1.2 MB

Filter weights: 360KB

If conv buffer is 128KB, then minimal bandwidth for activations is 3 x 1.2MB = 3.6MB

Convolutional buffer has to be multi-ported, internal RAM can be single ported

Memory Considerations

CONVOLUTION CORE

Example: 64 activations x (4 filters x 64 weights) → 4 partial outputs

256 MACs per clock cycle

Atomic Operation (Each Clock Cycle)

Input activations Filter weight data

CONVOLUTIONAL CORE

Calculate one stripe of outputs at a time

Change activations first, then weights

In the slide show:

C-dimension not shown

Only one kernel shown

Order of Calculations

Kernel R=S=3

Input data

Switch C

Output data

CONVOLUTION CORE

More samples in C-direction per atomic operation:

Allows for wider Wallace tree (half rather than full adders)

Causes more wasted MACs on non-aligned boundaries

Keeping one operand (weights) of MACs constant for a number of cycles saves dynamic power and reduces data transfers

Calculate full sums rather than partial sums as latter require higher precision before rounding (better to stream/store inputs than outputs)

Area and Power Considerations

AREA, PERFORMANCE, AND POWERConfigurations

INT8 data path

1 RAM interface

No advanced features

SMALL CONFIGURATION

INT8, INT16, FP16 data path

2 RAM interfaces

Integrated controller

Weight compression

LARGE CONFIGURATION

AREA, PERFORMANCE, POWERSmall Configurations (16nm, 1GHz)

INT8 MACs

(# instances)

Conv. Buffer

Memory BW

(GB/s)

ResNet50

(frames/s)

Power Eff.

TOPS/W)

2048 512 3.3 20 269 388 5.4

1024 256 1.8 15 153 185 6.3

512 256 1.4 10 93 107 6.8

256 256 1.0 5 46 64 5.6

128 256 0.84 2 20 41 3.8

64 128 0.55 1 7.3 28 2.0

Area is synthesis area + internal RAMs, does not account for layout inefficienciesPower is for DLA incl. internal RAMs, excluding SOC & external RAMsCalibrated to Xavier silicon - NVIDIA flows, libraries, RAM compilers, …DL TOPS == #convolutional MAC operations * 2

AREA, PERFORMANCE, POWERLarge Configuration (16nm, 1GHz)

Area and power do not include Tightly Coupled Memory (TCM)

Data Type

Internal

ResNet50

(frames/s)

Power Eff.

TOPS/W)

INT8 none 165 267 4.8

FP16 none 59 276 1.6

INT8 2M 230 348 5.1

FP16 2M 115 475 1.9

Configuration

INT16/FP16 512 MACs

INT8 1024 MACs

Conv Buffer 256 KB

Area 2.4 mm2

DRAM BW 15 GB/s

TCM R/W BW 25/25 GB/s

SW ARCHITECTURE

Compile time

Run time

parser compiler

Caffe modelcompiler

params

loadable

Application

User Mode

Driver

Kernel Mode

Driver

DLA hardware

NVDLA VIRTUAL PLATFORM

SW & RTL prototyping for all configs

Two formats for NVDLA + Memory

C-model

FPGA board

Easy access via cloud

Amazon Web Services (AWS)

OS, Application, UMD, KMD

QEMU CPU cluster

Memory

OPEN SOURCE SOC PROTOTYPENVDLA + SiFive RISC-V

Demo at SiFive booth

NVDLA config

Small config

2048 MACs

512 KB

YOLOv3 object recognition

FPGANVDLA

Mem IF

DRAM DRAM

rfaces

RISC-V

Mem IF

VIDEO FILE

Inserting video: Insert/Video/Video from File.Insert video by browsing your directory and selecting OK.

File types that works best in PowerPoint are mp4 or wmv

CHECK IT OUT FOR YOURSELF

http://nvdla.org/index.html

Documentation and source code are available under permissive license

Community contributions under Contributor License Agreement are encouraged

OR CONTRIBUTE!

THE NVIDIA DEEP LEARNING ACCELERATOR · 2018-08-19 · Encourage Deep Learning applications Invite...

Documents

FORM 10-K NVIDIA CORPORATION

Kurt Akeley NVIDIA Corporation NVIDIA U Presentation, 25 ... · NVIDIA Corporation NVIDIA U Presentation, 25 July 2003. NVIDIA CONFIDENTIAL Outline Data movement is the issue Memory

NVIDIA Deep Learning SDK NVIDIA Deep Learning SDK DA-08640-001_v01 | 1 Chapter 1. INTRODUCTION The NVIDIA Deep Learning SDK provides powerful tools and …

NVIDIA Professional Graphics Solutions | Line Cardimages.nvidia.com › content › quadro › product...NVIDIA Professional Graphics Solutions | Line Card Author: NVIDIA Corporation

DEEP LEARNING DEMYSTIFIED - calcul.math.cnrs.fr · DEMYSTIFIED NVIDIA Corporation. DEFINITIONS. PERCEPTRON Linear Classifier •Input ... Recommendation Recommendation Engine Algorithmic

NVIDIA DEEP LEARNING PLATFORM€¦ · The NVIDIA Deep Learning Platform The NVIDIA platform is designed to make deep learning accessible to every developer and data scientist anywhere

NVIDIA Volta Deep Learning AMI · ‣ PyTorch from NVIDIA AMI: NVIDIA Deep Learning pytorch AMI 20.03.1 Contents of the NVIDIA Deep Learning AMI ‣ Ubuntu Server: 18.04 LTS ‣ NVIDIA

NVIDIA Deep Learning

NVIDIA Quadro Professional Drivers Release 190 Noteses.download.nvidia.com/Windows/Quadro_Certified/190.38/190.38_… · NVIDIA Corporation ii NVIDIA Quadro Professional Drivers Release

When Shaders & TexturesCollide Kevin Bjorke NVIDIA Corporation

NVIDIA Accelerated Linux Driver Set Release 25 Notes · Release 25 Notes Software Version 1.0-3123 NVIDIA Corporation September 7, 2002. NVIDIA Corporation NVIDIA Accelerated Linux

NVIDIA Quadro Professional Drivers NVIDIA Control Panel ...kr.download.nvidia.com › Windows › Quadro_Certified › ... · NVIDIA Corporation iv NVIDIA Display Properties User’s

Memory-based Deep Reinforcement Learning for Obstacle ...Memory-based Deep Reinforcement Learning for Obstacle Avoidance in UAV with Limited ... We thank NVIDIA Corporation for the

Инструментарий Nvidia для deep learning

Deep Visual Learning on Hypersphere - Nvidia

CUDA CUBLAS Library - pudn.comread.pudn.com/downloads207/ebook/975474/CUBLAS_Library_2.1.pdf · NVIDIA Corporation CUBLAS Library PG-00000-002_V2.1 Published by NVIDIA Corporation

Production-Level Facial Performance CaptureUsing Deep ... · Production-Level Facial Performance Capture Using Deep Convolutional Neural Networks Samuli Laine NVIDIA Tero Karras NVIDIA

NVIDIA Control Panel Quick Start Guide€¦ · NVIDIA Corporation v NVIDIA Control Panel Quick Start Guide Table 2.1 Supported NVIDIA Consumer Products. . 6 Table 2.2 Supported NVIDIA

© NVIDIA Corporation 2009 Mark Harris NVIDIA Corporation Tesla GPU Computing A Revolution in High Performance Computing

DEEP LEARNING INTRODUCTION - NVIDIA Developer