View
105
Download
1
Category
Tags:
Preview:
DESCRIPTION
High-Performance Computing with NVIDIA Tesla GPUs. Timothy Lanfear, NVIDIA. Why GPU Computing?. Science is Desperate for Throughput. Gigaflops. 1 Exaflop. 1,000,000,000. 1 Petaflop. Bacteria 100s of Chromatophores. 1,000,000. Chromatophore 50M atoms. 1,000. Ribosome 2.7M atoms. - PowerPoint PPT Presentation
Citation preview
HIGH-PERFORMANCE COMPUTING WITH NVIDIA
TESLA GPUSTimothy Lanfear, NVIDIA
© NVIDIA Corporation 2009
WHY GPU COMPUTING?
© NVIDIA Corporation 2009
Science is Desperate for Throughput
1982 1997 2003 2006 2010 2012
1,000,000,000
1,000,000
1,000
1
Gigaflops
Estrogen Receptor36K atoms
F1-ATPase327K atoms
Ribosome2.7M atoms
Chromatophore50M atoms
BPTI3K atoms
Bacteria100s of
Chromatophores
1 Exaflop
1 Petaflop
Ran for 8 months to simulate 2 nanoseconds
© NVIDIA Corporation 2009
Power Crisis in Supercomputing
1982 1996 2008 2020
Exaflop
Petaflop
Teraflop
Gigaflop
Household Power Equivalent
City
Town
Neighborhood
Block
7,000,000 Watts
25,000,000 Watts
850,000 Watts
60,000 Watts
JaguarLos Alamos
© NVIDIA Corporation 2009
“Oak Ridge National Lab (ORNL) has already announced it will be using Fermi technology in an upcoming super that is ‘expected to be 10-times more powerful than today’s fastest supercomputer.’
Since ORNL’s Jaguar supercomputer, for all intents and purposes, holds that title, and is in the process of being upgraded to 2.3 Petaflops …
… we can surmise that the upcoming Fermi-equipped super is going to be in the 20 Petaflops range.”
September 30 2009
© NVIDIA Corporation 2009
What is GPU Computing?
Computing with CPU + GPUHeterogeneous Computing
x86 GPUPCIe bus
© NVIDIA Corporation 2009
Low Latency or High Throughput?
CPUOptimised for low-latency access to cached data setsControl logic for out-of-order and speculative execution
GPUOptimised for data-parallel, throughput computationArchitecture tolerant of memory latencyMore transistors dedicated to computation
Cache
ALUControl
ALU
ALU
ALU
DRAM
DRAM
© NVIDIA Corporation 2009
Why Didn’t GPU Computing Take Off Sooner?
GPU ArchitectureGaming oriented, process pixel for displaySingle threaded operationsNo shared memory
Development ToolsGraphics oriented (OpenGL, GLSL)University research (Brook)Assembly language
DeploymentGaming solutions with limited lifetimeExpensive OpenGL professional graphics boardsNo HPC compatible products
© NVIDIA Corporation 2009
NVIDIA Invested in GPU Computing in 2004
Strategic move for the companyExpand GPU architecture beyond pixel processingFuture platforms will be hybrid, multi/many cores based
Hired key industry expertsx86 architecturex86 compilerHPC hardware specialist
Create a GPU based Compute Ecosystem by 2008
© NVIDIA Corporation 2009
NVIDIA GPU Computing Ecosystem
NVIDIA Hardware Solutions
CUDA SDK & Tools
GPU Architecture
Deployment
Hardware
Architecture
Customer
Requirements
Customer
Application
TPP / OEMCUDA Development
Specialist
ISV
CUDA Training
CompanyHardware Architect
VAR
© NVIDIA Corporation 2009
TeslaTM
High-Performance ComputingQuadro®
Design & CreationGeForce®
Entertainment
NVIDIA GPU Product Families
© NVIDIA Corporation 2009
Many-Core High Performance Computing
NVIDIA’s 10-series GPU has 240 cores
Each core has aFloating point / integer unitLogic unit Move, compare unitBranch unit
Cores managed by thread managerThread manager can spawn and manage 30,000+ threads Zero overhead thread switching
1.4 billion transistors
1 Teraflop of processing power
240 processing cores
NVIDIA 10-Series GPU
NVIDIA’s 2nd Generation CUDA Processor
© NVIDIA Corporation 2009
Tesla S1070 1U System
Tesla C1060 Computing Board
Tesla Personal Supercomputer
GPUs 2 Tesla GPUs 4 Tesla GPUs 1 Tesla GPU 4 Tesla GPUsSingle Precision
Performance1.87 Teraflops 4.14 Teraflops 933 Gigaflops 3.7 Teraflops
Double PrecisionPerformance
156 Gigaflops 346 Gigaflops 78 Gigaflops 312 Gigaflops
Memory 8 GB (4 GB / GPU) 16 GB (4 GB / GPU) 4 GB 16 GB (4 GB / GPU)
SuperMicro 1UGPU SuperServer
Tesla GPU Computing Products
© NVIDIA Corporation 2009
Tesla C1060 Computing Processor
Processor 1 × Tesla T10Number of cores 240
Core Clock 1.296 GHz
Floating Point Performance
933 Gflops Single Precision
78 Gflops Double Precision
On-board memory 4.0 GB Memory bandwidth 102 GB/sec peak
Memory I/O 512-bit, 800MHz GDDR3
Form factor Full ATX: 4.736″ x 10.5″Dual slot wide
System I/O PCIe ×16 Gen2
Typical power 160 W
© NVIDIA Corporation 2009
Processor 1 × Tesla T10Number of cores 240
Core Clock 1.296 GHz
Floating Point Performance
933 Gflops Single Precision
78 Gflops Double Precision
On-board memory 4.0 GB Memory bandwidth 102 GB/sec peak
Memory I/O 512-bit, 800MHz GDDR3
Form factor Full ATX: 4.736″ x 10.5″Dual slot wide
System I/O PCIe ×16 Gen2
Typical power 160 W
Tesla M1060 Embedded Module
OEM-only productAvailable as integrated product in OEM systems
© NVIDIA Corporation 2009
Tesla Personal Supercomputer
Supercomputing PerformanceMassively parallel CUDA Architecture960 cores. 4 Teraflops250× the performance of a desktop
Personal One researcher, one supercomputerPlugs into standard power strip
AccessibleProgram in C for Windows, LinuxAvailable now worldwide under $10,000
© NVIDIA Corporation 2009
Tesla S1070 1U SystemProcessors 4 × Tesla T10
Number of cores 960
Core Clock 1.44 GHz
Performance 4 Teraflops
Total system memory 16.0 GB (4.0 GB per T10)
Memory bandwidth408 GB/sec peak
(102 GB/sec per T10)
Memory I/O2048-bit, 800MHz GDDR3
(512-bit per T10)
Form factor 1U (EIA 19″ rack)
System I/O 2 PCIe ×16 Gen2
Typical power 700 W
© NVIDIA Corporation 2009
SuperMicro GPU 1U SuperServer
Two M1060 GPUs in a 1U
Dual Nehalem-EP Xeon CPUs
Up to 96 GB DDR3 ECC
Onboard Infiniband (QDR)
3× hot-swap 3.5″ SATA HDD
1200 W power supply
M1060 GPUs
© NVIDIA Corporation 2009
Tesla Cluster Configurations
Modular 2U compute nodeTesla S1070 + Host Server
4 Teraflops
Integrated 1U compute nodeGPU SuperServer
2 Teraflops
© NVIDIA Corporation 2009
CUDA FortranCUDA C Java and PythonOpenCL™ DirectCompute
GPU Computing Applications
CUDA Parallel Computing Architecture
NVIDIA GPU with the CUDA Parallel Computing Architecture
OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.
© NVIDIA Corporation 2009
NVIDIA CUDA C and OpenCL
Shared back-end compiler and optimization technology
OpenCL
CUDA C
PTX
GPU
Entry point for developers who prefer high-level C
Entry point for developers who want low-level API
© NVIDIA Corporation 2009
CUDA LibrariescuFFT cuBLAS cuDPP
CUDA Compiler
C Fortran
CUDA Tools
Debugger Profiler
CPU Hardware
PCI-E Switch1U
Application Software(written in C)
4 cores 240 cores
© NVIDIA Corporation 2009
NVIDIA Nexus
Register for the Beta here at GTC!
The first development environment for massively parallel applications.
Beta available October 2009Releasing in Q1 2010
Hardware GPU Source Debugging
Platform-wide Analysis
Complete Visual Studio integration
http://developer.nvidia.com/object/nexus.html
Graphics Inspector
Platform Trace
Parallel Source Debugging
© NVIDIA Corporation 2009
CUDA Zone: www.nvidia.com/CUDA CUDA Toolkit
CompilerLibraries
CUDA SDKCode samples
CUDA Profiler
Forums
Resources for CUDA developers
© NVIDIA Corporation 2009
Wide Developer Acceptance and Success
146X
Interactive visualization of
volumetric white matter connectivity
36X
Ion placement for molecular dynamics simulation
19X
Transcoding HD video stream to
H.264
17X
Simulation in Matlab using .mex file CUDA function
100X
Astrophysics N-body simulation
149X
Financial simulation of
LIBOR model with swaptions
47X
GLAME@lab: An M-script API for linear Algebra
operations on GPU
20X
Ultrasound medical imaging
for cancer diagnostics
24X
Highly optimized object oriented
molecular dynamics
30X
Cmatch exact string matching to
find similar proteins and gene
sequences
© NVIDIA Corporation 2009
CUDA Co-Processing Ecosystem
Applications LibrariesFFT
BLASLAPACK
Image processingVideo processingSignal processing
Vision
Consultants OEMs
LanguagesC, C++DirectXFortranJava
OpenCLPython
CompilersPGI FortranCAPs HMPP
MCUDAMPI
NOAA Fortran2COpenMP
UIUCMIT
HarvardBerkeley
CambridgeOxford
…
IIT DelhiTsinghua
DortmundtETH Zurich
MoscowNTU…
Over 200 Universities Teaching CUDA
ANEO
GPU Tech
Oil & Gas Finance
Medical Biophysics
Numerics
Imaging
CFD
DSP EDA
© NVIDIA Corporation 2009
What We Did in the Past Three Years
2006G80, first GPU with built-in compute features, 128 core multi-threaded, scalable architectureCUDA SDK Beta
2007Tesla HPC product lineCUDA SDK 1.0, 1.1
2008GT200, second GPU generation, 240 core, 64-bitTesla HPC second generationCUDA SDK 2.0
2009 …
© NVIDIA Corporation 2009
NEXT-GENERATION GPU ARCHITECTURE — ‘FERMI’
© NVIDIA Corporation 2009
3 billion transistors
Over 2× the cores (512 total)
8× the peak DP performance
ECC
L1 and L2 caches
~2× memory bandwidth (GDDR5)
Up to 1 Terabyte of GPU memory
Concurrent kernels
Hardware support for C++
DR
AM
I/F
HO
ST I/
FG
iga
Thre
adD
RA
M I/
F DR
AM
I/FD
RA
M I/F
DR
AM
I/FD
RA
M I/F
L2
Introducing the ‘Fermi’ ArchitectureThe Soul of a Supercomputer in the body of a GPU
© NVIDIA Corporation 2009
DataParallel
InstructionParallel
Many Decisions Large Data Sets
CPU
GPU
Design Goal of Fermi
Expand performance sweet spot of the GPU
Bring more users, more applications to the GPU
© NVIDIA Corporation 2009
Streaming Multiprocessor ArchitectureRegister File
Scheduler
Dispatch
Scheduler
Dispatch
Load/Store Units x 16Special Func Units x 4
Interconnect Network
64K ConfigurableCache/Shared Mem
Uniform Cache
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Instruction Cache
32 CUDA cores per SM (512 total)
8× peak double precision floating point performance
50% of peak single precision
Dual Thread Scheduler
64 KB of RAM for shared memory and L1 cache (configurable)
© NVIDIA Corporation 2009
CUDA Core ArchitectureRegister File
Scheduler
Dispatch
Scheduler
Dispatch
Load/Store Units x 16Special Func Units x 4
Interconnect Network
64K ConfigurableCache/Shared Mem
Uniform Cache
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Instruction Cache
CUDA CoreDispatch Port
Operand Collector
Result Queue
FP Unit INT Unit
New IEEE 754-2008 floating-point standard, surpassing even the most advanced CPUs
Fused multiply-add (FMA) instruction for both single and double precision
Newly designed integer ALU optimized for 64-bit and extended precision operations
© NVIDIA Corporation 2009
Cached Memory Hierarchy
First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory
L1 Cache per SM (32 cores)Improves bandwidth and reduces latency
Unified L2 Cache (768 KB)Fast, coherent data sharing across all cores in the GPU
DR
AM
I/F
Gig
a Th
read
HO
ST I/
FD
RA
M I/
F
DR
AM
I/FD
RA
M I/F
DR
AM
I/FD
RA
M I/F
L2
Parallel DataCache™
Memory Hierarchy
© NVIDIA Corporation 2009
Larger, Faster Memory Interface
GDDR5 memory interface2× speed of GDDR3
Up to 1 Terabyte of memory attached to GPU
Operate on large data sets
DR
AM
I/F
Gig
a Th
read
HO
ST I/
FD
RA
M I/
F
DR
AM
I/FD
RA
M I/F
DR
AM
I/FD
RA
M I/F
L2
© NVIDIA Corporation 2009
Error Correcting Code
ECC protection forDRAM
ECC supported for GDDR5 memory
All major internal memories are ECC protectedRegister file, L1 cache, L2 cache
© NVIDIA Corporation 2009
GigaThreadTM Hardware Thread Scheduler
Hierarchically manages thousands of simultaneously active threads
10× faster application context switching
Concurrent kernel execution
HTS
© NVIDIA Corporation 2009
GigaThread Hardware Thread Scheduler Concurrent Kernel Execution + Faster Context Switch
Serial Kernel Execution Parallel Kernel Execution
Tim
e
Kernel 1 Kernel 1 Kernel 2
Kernel 2 Kernel 3
Kernel 3
Ker4
nel Kernel 5
Kernel 5
Kernel 4
Kernel 2
Kernel 2
© NVIDIA Corporation 2009
GigaThread Streaming Data Transfer Engine
Dual DMA enginesSimultaneous CPUGPU and GPUCPU data transferFully overlapped with CPU and GPU processing time
Activity Snapshot:
SDT
Kernel 0Kernel 1
Kernel 2Kernel 3
CPU
CPU
CPU
CPU
SDT0
SDT0
SDT0
SDT0
GPU
GPU
GPU
GPU
SDT1
SDT1
SDT1
SDT1
© NVIDIA Corporation 2009
Enhanced Software Support
Full C++ SupportVirtual functionsTry/Catch hardware support
System call supportSupport for pipes, semaphores, printf, etc
Unified 64-bit memory addressing
© NVIDIA Corporation 2009
I believe history will record Fermi as a significant milestone.“ ”
Dave PattersonDirector Parallel Computing Research Laboratory, U.C. BerkeleyCo-Author of Computer Architecture: A Quantitative Approach
Fermi surpasses anything announced by NVIDIA's leading GPU competitor (AMD).“ ”
Tom HalfhillSenior Editor
Microprocessor Report
© NVIDIA Corporation 2009
Fermi is the world’s first complete GPU computing architecture.“ ”
Peter GlaskowskyTechnology AnalystThe Envisioneering Group
The convergence of new, fast GPUs optimized for computation as well as 3-D graphics acceleration and industry-standard software
development tools marks the real beginning of the GPU computing era. Gentlemen, start your GPU computing engines.
Nathan BrookwoodPrinciple Analyst & Founder
Insight 64
”“
© NVIDIA Corporation 2009
2006 2010 2012 20152008
GPU
T8128 core
T10240 core
A 2015 GPU *~20× the performance of today’s GPU~5,000 cores at ~3 GHz (50 mW each)~20 TFLOPS~1.2 TB/s of memory bandwidth
* This is a sketch of a what a GPU in 2015 might look like; it does not reflect any actual product plans.
GPU Revolutionizing ComputingGFlops
Fermi512 core
© NVIDIA Corporation 2009
Recommended