High-Performance Computing with NVIDIA Tesla GPUs

HIGH-PERFORMANCE COMPUTING WITH NVIDIA

TESLA GPUSTimothy Lanfear, NVIDIA

WHY GPU COMPUTING?

Science is Desperate for Throughput

1982 1997 2003 2006 2010 2012

1,000,000,000

1,000,000

Gigaflops

Estrogen Receptor36K atoms

F1-ATPase327K atoms

Ribosome2.7M atoms

Chromatophore50M atoms

BPTI3K atoms

Bacteria100s of

Chromatophores

1 Exaflop

1 Petaflop

Ran for 8 months to simulate 2 nanoseconds

Power Crisis in Supercomputing

1982 1996 2008 2020

Exaflop

Petaflop

Teraflop

Gigaflop

Household Power Equivalent

Neighborhood

7,000,000 Watts

25,000,000 Watts

850,000 Watts

60,000 Watts

JaguarLos Alamos

“Oak Ridge National Lab (ORNL) has already announced it will be using Fermi technology in an upcoming super that is ‘expected to be 10-times more powerful than today’s fastest supercomputer.’

Since ORNL’s Jaguar supercomputer, for all intents and purposes, holds that title, and is in the process of being upgraded to 2.3 Petaflops …

… we can surmise that the upcoming Fermi-equipped super is going to be in the 20 Petaflops range.”

September 30 2009

What is GPU Computing?

Computing with CPU + GPUHeterogeneous Computing

x86 GPUPCIe bus

Low Latency or High Throughput?

CPUOptimised for low-latency access to cached data setsControl logic for out-of-order and speculative execution

GPUOptimised for data-parallel, throughput computationArchitecture tolerant of memory latencyMore transistors dedicated to computation

ALUControl

Why Didn’t GPU Computing Take Off Sooner?

GPU ArchitectureGaming oriented, process pixel for displaySingle threaded operationsNo shared memory

Development ToolsGraphics oriented (OpenGL, GLSL)University research (Brook)Assembly language

DeploymentGaming solutions with limited lifetimeExpensive OpenGL professional graphics boardsNo HPC compatible products

NVIDIA Invested in GPU Computing in 2004

Strategic move for the companyExpand GPU architecture beyond pixel processingFuture platforms will be hybrid, multi/many cores based

Hired key industry expertsx86 architecturex86 compilerHPC hardware specialist

Create a GPU based Compute Ecosystem by 2008

NVIDIA GPU Computing Ecosystem

NVIDIA Hardware Solutions

CUDA SDK & Tools

GPU Architecture

Deployment

Hardware

Architecture

Customer

Requirements

Customer

Application

TPP / OEMCUDA Development

Specialist

CUDA Training

CompanyHardware Architect

TeslaTM

High-Performance ComputingQuadro®

Design & CreationGeForce®

Entertainment

NVIDIA GPU Product Families

Many-Core High Performance Computing

NVIDIA’s 10-series GPU has 240 cores

Each core has aFloating point / integer unitLogic unit Move, compare unitBranch unit

Cores managed by thread managerThread manager can spawn and manage 30,000+ threads Zero overhead thread switching

1.4 billion transistors

1 Teraflop of processing power

240 processing cores

NVIDIA 10-Series GPU

NVIDIA’s 2nd Generation CUDA Processor

Tesla S1070 1U System

Tesla C1060 Computing Board

Tesla Personal Supercomputer

GPUs 2 Tesla GPUs 4 Tesla GPUs 1 Tesla GPU 4 Tesla GPUsSingle Precision

Performance1.87 Teraflops 4.14 Teraflops 933 Gigaflops 3.7 Teraflops

Double PrecisionPerformance

156 Gigaflops 346 Gigaflops 78 Gigaflops 312 Gigaflops

Memory 8 GB (4 GB / GPU) 16 GB (4 GB / GPU) 4 GB 16 GB (4 GB / GPU)

SuperMicro 1UGPU SuperServer

Tesla GPU Computing Products

Tesla C1060 Computing Processor

Processor 1 × Tesla T10Number of cores 240

Core Clock 1.296 GHz

Floating Point Performance

933 Gflops Single Precision

78 Gflops Double Precision

On-board memory 4.0 GB Memory bandwidth 102 GB/sec peak

Memory I/O 512-bit, 800MHz GDDR3

Form factor Full ATX: 4.736″ x 10.5″Dual slot wide

System I/O PCIe ×16 Gen2

Typical power 160 W

Processor 1 × Tesla T10Number of cores 240

Core Clock 1.296 GHz

Floating Point Performance

933 Gflops Single Precision

78 Gflops Double Precision

On-board memory 4.0 GB Memory bandwidth 102 GB/sec peak

Memory I/O 512-bit, 800MHz GDDR3

Form factor Full ATX: 4.736″ x 10.5″Dual slot wide

System I/O PCIe ×16 Gen2

Typical power 160 W

Tesla M1060 Embedded Module

OEM-only productAvailable as integrated product in OEM systems

Tesla Personal Supercomputer

Supercomputing PerformanceMassively parallel CUDA Architecture960 cores. 4 Teraflops250× the performance of a desktop

Personal One researcher, one supercomputerPlugs into standard power strip

AccessibleProgram in C for Windows, LinuxAvailable now worldwide under $10,000

Tesla S1070 1U SystemProcessors 4 × Tesla T10

Number of cores 960

Core Clock 1.44 GHz

Performance 4 Teraflops

Total system memory 16.0 GB (4.0 GB per T10)

Memory bandwidth408 GB/sec peak

(102 GB/sec per T10)

Memory I/O2048-bit, 800MHz GDDR3

(512-bit per T10)

Form factor 1U (EIA 19″ rack)

System I/O 2 PCIe ×16 Gen2

Typical power 700 W

SuperMicro GPU 1U SuperServer

Two M1060 GPUs in a 1U

Dual Nehalem-EP Xeon CPUs

Up to 96 GB DDR3 ECC

Onboard Infiniband (QDR)

3× hot-swap 3.5″ SATA HDD

1200 W power supply

M1060 GPUs

Tesla Cluster Configurations

Modular 2U compute nodeTesla S1070 + Host Server

4 Teraflops

Integrated 1U compute nodeGPU SuperServer

2 Teraflops

CUDA FortranCUDA C Java and PythonOpenCL™ DirectCompute

GPU Computing Applications

CUDA Parallel Computing Architecture

NVIDIA GPU with the CUDA Parallel Computing Architecture

OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.

NVIDIA CUDA C and OpenCL

Shared back-end compiler and optimization technology

OpenCL

CUDA C

Entry point for developers who prefer high-level C

Entry point for developers who want low-level API

CUDA LibrariescuFFT cuBLAS cuDPP

CUDA Compiler

C Fortran

CUDA Tools

Debugger Profiler

CPU Hardware

PCI-E Switch1U

Application Software(written in C)

4 cores 240 cores

NVIDIA Nexus

Register for the Beta here at GTC!

The first development environment for massively parallel applications.

Beta available October 2009Releasing in Q1 2010

Hardware GPU Source Debugging

Platform-wide Analysis

Complete Visual Studio integration

http://developer.nvidia.com/object/nexus.html

Graphics Inspector

Platform Trace

Parallel Source Debugging

CUDA Zone: www.nvidia.com/CUDA CUDA Toolkit

CompilerLibraries

CUDA SDKCode samples

CUDA Profiler

Forums

Resources for CUDA developers

Wide Developer Acceptance and Success

Interactive visualization of

volumetric white matter connectivity

Ion placement for molecular dynamics simulation

Transcoding HD video stream to

Simulation in Matlab using .mex file CUDA function

Astrophysics N-body simulation

Financial simulation of

LIBOR model with swaptions

GLAME@lab: An M-script API for linear Algebra

operations on GPU

Ultrasound medical imaging

for cancer diagnostics

Highly optimized object oriented

molecular dynamics

Cmatch exact string matching to

find similar proteins and gene

sequences

CUDA Co-Processing Ecosystem

Applications LibrariesFFT

BLASLAPACK

Image processingVideo processingSignal processing

Vision

Consultants OEMs

LanguagesC, C++DirectXFortranJava

OpenCLPython

CompilersPGI FortranCAPs HMPP

MCUDAMPI

NOAA Fortran2COpenMP

UIUCMIT

HarvardBerkeley

CambridgeOxford

IIT DelhiTsinghua

DortmundtETH Zurich

MoscowNTU…

Over 200 Universities Teaching CUDA

GPU Tech

Oil & Gas Finance

Medical Biophysics

Numerics

Imaging

DSP EDA

What We Did in the Past Three Years

2006G80, first GPU with built-in compute features, 128 core multi-threaded, scalable architectureCUDA SDK Beta

2007Tesla HPC product lineCUDA SDK 1.0, 1.1

2008GT200, second GPU generation, 240 core, 64-bitTesla HPC second generationCUDA SDK 2.0

2009 …

NEXT-GENERATION GPU ARCHITECTURE — ‘FERMI’

3 billion transistors

Over 2× the cores (512 total)

8× the peak DP performance

L1 and L2 caches

~2× memory bandwidth (GDDR5)

Up to 1 Terabyte of GPU memory

Concurrent kernels

Hardware support for C++

Introducing the ‘Fermi’ ArchitectureThe Soul of a Supercomputer in the body of a GPU

DataParallel

InstructionParallel

Many Decisions Large Data Sets

Design Goal of Fermi

Expand performance sweet spot of the GPU

Bring more users, more applications to the GPU

Streaming Multiprocessor ArchitectureRegister File

Scheduler

Dispatch

Scheduler

Dispatch

Load/Store Units x 16Special Func Units x 4

Interconnect Network

64K ConfigurableCache/Shared Mem

Uniform Cache

Instruction Cache

32 CUDA cores per SM (512 total)

8× peak double precision floating point performance

50% of peak single precision

Dual Thread Scheduler

64 KB of RAM for shared memory and L1 cache (configurable)

CUDA Core ArchitectureRegister File

Scheduler

Dispatch

Scheduler

Dispatch

Load/Store Units x 16Special Func Units x 4

Interconnect Network

64K ConfigurableCache/Shared Mem

Uniform Cache

Instruction Cache

CUDA CoreDispatch Port

Operand Collector

Result Queue

FP Unit INT Unit

New IEEE 754-2008 floating-point standard, surpassing even the most advanced CPUs

Fused multiply-add (FMA) instruction for both single and double precision

Newly designed integer ALU optimized for 64-bit and extended precision operations

Cached Memory Hierarchy

First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory

L1 Cache per SM (32 cores)Improves bandwidth and reduces latency

Unified L2 Cache (768 KB)Fast, coherent data sharing across all cores in the GPU

Parallel DataCache™

Memory Hierarchy

Larger, Faster Memory Interface

GDDR5 memory interface2× speed of GDDR3

Up to 1 Terabyte of memory attached to GPU

Operate on large data sets

Error Correcting Code

ECC protection forDRAM

ECC supported for GDDR5 memory

All major internal memories are ECC protectedRegister file, L1 cache, L2 cache

GigaThreadTM Hardware Thread Scheduler

Hierarchically manages thousands of simultaneously active threads

10× faster application context switching

Concurrent kernel execution

GigaThread Hardware Thread Scheduler Concurrent Kernel Execution + Faster Context Switch

Serial Kernel Execution Parallel Kernel Execution

Kernel 1 Kernel 1 Kernel 2

Kernel 2 Kernel 3

Kernel 3

nel Kernel 5

Kernel 5

Kernel 4

Kernel 2

GigaThread Streaming Data Transfer Engine

Dual DMA enginesSimultaneous CPUGPU and GPUCPU data transferFully overlapped with CPU and GPU processing time

Activity Snapshot:

Kernel 0Kernel 1

Kernel 2Kernel 3

Enhanced Software Support

Full C++ SupportVirtual functionsTry/Catch hardware support

System call supportSupport for pipes, semaphores, printf, etc

Unified 64-bit memory addressing

I believe history will record Fermi as a significant milestone.“ ”

Dave PattersonDirector Parallel Computing Research Laboratory, U.C. BerkeleyCo-Author of Computer Architecture: A Quantitative Approach

Fermi surpasses anything announced by NVIDIA's leading GPU competitor (AMD).“ ”

Tom HalfhillSenior Editor

Microprocessor Report

Fermi is the world’s first complete GPU computing architecture.“ ”

Peter GlaskowskyTechnology AnalystThe Envisioneering Group

The convergence of new, fast GPUs optimized for computation as well as 3-D graphics acceleration and industry-standard software

development tools marks the real beginning of the GPU computing era. Gentlemen, start your GPU computing engines.

Nathan BrookwoodPrinciple Analyst & Founder

Insight 64

”“

2006 2010 2012 20152008

T8128 core

T10240 core

A 2015 GPU *~20× the performance of today’s GPU~5,000 cores at ~3 GHz (50 mW each)~20 TFLOPS~1.2 TB/s of memory bandwidth

* This is a sketch of a what a GPU in 2015 might look like; it does not reflect any actual product plans.

GPU Revolutionizing ComputingGFlops

Fermi512 core

High-Performance Computing with NVIDIA Tesla GPUs

Documents

NVIDIA Tesla GPUs · 8 NVIDIA® Tesla ® V100 GPUs with ... Onboard VGA 1 VGA port via ASPEED AST2500 BMC 1 VGA port via ASPEED AST2500 BMC Management IPMI 2.0 with virtual media

Advances in HP Servers with Integrated NVIDIA GPUs• HP ProLiant SL250s Gen8 servers, each with two NVIDIA Tesla K20 GPUs • #71 on the Jun’14 TOP500 list, with 690GF peak performance

HPC Computing with CUDA and Tesla Hardwarejbaker/PDC-Sp12/slides/CUDA... · HPC Computing with CUDA and Tesla Hardware Tim Lanfear, NVIDIA. ... NVIDIA Switch NVIDIA Tesla GPU NVIDIA

NVIDIAimages.nvidia.com/content/pdf/about-nvidia/nvidia-2017...offer NVIDIA GPUs in the cloud. IBM’s “Minsky” POWER8 and NVIDIA Tesla P100 server is purpose-built for AI. SAP

HP Nvidia Defective GPUs

insideHPC Guide to QCT Platform-on-Demand Designed for ... · NVIDIA® Tesla® V100 GPUs. In a future roadmap, QCT will also provide NVIDIA® Tesla® A100 GPUs to accelerate performance

GPUs: NVIDIA Tesla, Fermi, and CUDA

Tesla Driver version 418.126.02 (Linux) / 426.50 (Windows)€¦ · Supported NVIDIA Tesla GPUs The Tesla driver package is designed for systems that have one or more Tesla products

OpenPOWERand NVIDIA GPUs - ECMWF · OpenPOWERand NVIDIA GPUs ... & 50+ more … 3 2 x POWER8 CPUs ... Bandwidth 2 x NVIDIA Tesla K40 GPU Accelerators Linux Operating System IBM POWER

Kubernetes on NVIDIA GPUs · KUBERNETES ON NVIDIA GPUS: CSP INSTALLATION GUIDE This document serves as a step-by-step guide to getting started with using Kubernetes on NVIDIA GPUs

NVIDIA TESLA GPU COMPUTING REVOLUTIONIZING HIGH …kr.nvidia.com/docs/IO/90688/tesla-brochure-12-lr.pdf · 2011-01-12 · NVIDIA TESLA GPUS ARE REVOLUTIONIZING COMPUTING CUDA PARALLEL

NVIDIA GRID - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2018/presentation/s... · NVIDIA TESLA GPUS M10 M60 P40 M6 P6 GPU 4 NVIDIA Maxwell GPUs 2 NVIDIA Maxwell GPUs

ACCELERATING ANSYS FLUENT 15.0 USING NVIDIA GPUS · 2014-06-25 · Accelerating ANSYS Fluent 15.0 Using NVIDIA GPUs DA-07311-001_v01 | 5 ACCELERATING ANSYS FLUENT USING NVIDIA GPUS

GPUMODESWITCH - Nvidia · To address these problems, certain NVIDIA Tesla GPUs support setting the GPU into ... index of the GPU you wish to switch: # gpumodeswitch --gpumode graphics

Introducing Hybrid SLI Technology - NVIDIA · Hybrid SLI Technology Built upon NVIDIA SLI ® technology, Hybrid SLI ® enables NVIDIA ® discrete GPUs and NVIDIA motherboard GPUs

DEEP LEARNING WITH NVIDIA GPUS LEARNING WITH NVIDIA GPUS . 2 ... Deep Neural Network “watches” human drivers , ... Software Systems Hardware cuDNN DIGITS DevBox Titan X Tesla

MICROWAY’S NVIDIA TESLA V100 GPU SOLUTIONS GUIDE · 16 TESLA GPUS, UNIFIED MEMORY WITH NVIDIA NVSWITCH ™ 16 fully interconnected Tesla V100 GPUs, 2 TensorPFLOPS and 512GB of unified

Financial computing on NVIDIA GPUs

Synthesisinan OpenCL* BasedFPGA Programming Methodologychavet/orga/HLS4HPC_2013/... · Motivation–OpenCLAdoption Intel% CPUs% AMD% CPUs% NVIDIA% Tesla GPUs% AMD% GPUs% IBM Power

NVIDIA GPUs POWER ADOBE