Upload
proxymo1
View
218
Download
0
Embed Size (px)
Citation preview
7/29/2019 01_TeslaCUDAIntro
1/33
Tesla GPU ComputingA Revolut ion in High Performance Compu t ing
Monash University GPU Computing Workshop
May 28, 2009
7/29/2019 01_TeslaCUDAIntro
2/33
4 cores
What is GPU Computing?
Computing with CPU + GPUHeterogeneous Computin g
7/29/2019 01_TeslaCUDAIntro
3/33
GPUs: Turning Point in Supercomputing
Te
SuCalcUA$5 Mil l ion
Source: University of Antwerp, Belgium
67.4 secs
59.9 secs
55 60 65 70
256 AMD dual-core Opterons
4 TeslaC1060 GPUs
Digital TomographyReconstruction Time
Desktop beats Cluster
7/29/2019 01_TeslaCUDAIntro
4/33
146X
Medical ImagingU of Utah
36X
Molecular DynamicsU of Illinois, Urbana
18X
Video TranscodingElemental Tech
50X
Matlab ComputingAccelerEyes
149X
Financial simulationOxford
47X
Linear AlgebraUniversidad Jaime
20X
3D UltrasoundTechniscan
130X
Quantum ChemistryU of Illinois, Urbana
G
Not 2x or 3x : Speedups are 20x to 150x
7/29/2019 01_TeslaCUDAIntro
5/33
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
NVIDIA Tesla 10-Series GPU
Massively parallel, many core a
7/29/2019 01_TeslaCUDAIntro
6/33
7/29/2019 01_TeslaCUDAIntro
7/33
TeslaTMHigh-Performance Computing
QuaDesign &
GeForce
Entertainment
Parallel Computing on All GPUs100+ Mil l ion CUDA GPUs Deployed
7/29/2019 01_TeslaCUDAIntro
8/33
Tesla GPU Computing Products
Tesla S1070 1U SystemTesla C1060
Computing Board
TeslSupe
(4 Te
GPUs 4 Tesla GPUs 1 Tesla GPU 4
Single Precision Perf 4.14 Teraflops 933 Gigaflops 3
Double Precision Perf 346 Gigaflops 78 Gigaflops 3
Memory 4 GB / GPU 4 GB
7/29/2019 01_TeslaCUDAIntro
9/33
Tesla S1070: Green Supercomputing
20X BetterPerformance / Watt
Hess
Chevron
Petrobras
NCSA
CEA
Tokyo Tech
JFCOM
SAIC
Federal
Motorola
Kodak
Univer
Univer
Univer
Max Pl
Rice U
Univer
GusGu
Eotvas
Univer
Chines
Nation
7/29/2019 01_TeslaCUDAIntro
10/33
A $5 Million Datacenter
CPU 1U Server CPU 1
Tesla 1
2 Quad-core XeonCPUs: 8 cores
0.17 Teraflop (single)0.08 Teraflop (double)
$ 3,000
700 W
1819 CPU servers
310 Teraflops (single)
155 Teraflops (double)
Total area 16K sq feet
Total 1273 KW
8 CPU Cores +4 GPUs = 968 cores
4.14 Teraflops (single)0.346 Teraflop (double)
$ 11,000
1500 W
455 CPU servers
455 Tesla systems1961 Teraflops (single)
196 Teraflops (double)
Total area 9K sq feet
Total 682 KW
7/29/2019 01_TeslaCUDAIntro
11/33
Introducing the Tesla Personal Supercomp
Supercomputing PerformanMassively parallel CUDA Archite
960 cores. 4 TeraFlops250x the performance of a desk
PersonalOne researcher, one supercomp
Plugs into standard power strip
AccessibleProgram in C for Windows, Linu
Available now worldwide under
7/29/2019 01_TeslaCUDAIntro
12/33
More Information
Tesla main page
http://www.nvidia.com/tesla
Product Specs, MarketingLiterature
CUDA Zone
http://www.nvidia.com/cuda
Applications, Papers, Videos
YouTube Videoshttp://www.youtube.co
Hear CUDA develabout their experi
http://www.nvidia.com/teslahttp://www.nvidia.com/cudahttp://www.youtube.com/nvidiateslahttp://www.youtube.com/nvidiateslahttp://www.nvidia.com/cudahttp://www.nvidia.com/tesla7/29/2019 01_TeslaCUDAIntro
13/33
CUDA Parallel Programming Architectureand Programming Model
7/29/2019 01_TeslaCUDAIntro
14/33
CUDA Parallel Computing Architecture
ATIs Compute Solution
Parallel computing architectureand programming model
Includes a C compiler plussupport for OpenCL and
DX11 Compute
Architected to natively supportall computational interfaces
(standard languages and APIs)
7/29/2019 01_TeslaCUDAIntro
15/33
CUDA Facts
750+ Research Papers
50+ universities teaching CUDA
100 Million CUDA-Enabled GPUs
25K Active Developers
www.NVIDIA.com
7/29/2019 01_TeslaCUDAIntro
16/33
Simple C Description For Parallelism
void saxpy_serial(int n, float a, float *x, float *y){ for (int i = 0; i < n; ++i)y[i] = a*x[i] + y[i];}// Invoke serial SAXPY kernelsaxpy_serial(n, 2.0, x, y);__global__ void saxpy_parallel(int n, float a, float *x, float { int i = blockIdx.x*blockDim.x + threadIdx.x;if (i < n) y[i] = a*x[i] + y[i];}// Invoke parallel SAXPY kernel with 256 threads/blockint nblocks = (n + 255) / 256;saxpy_parallel(n, 2.0, x, y);
Standa
Para
7/29/2019 01_TeslaCUDAIntro
17/33
Compiling C for CUDA Applications
CUDACC CPU Compiler
C for CUDAKernels
CUDA objectfiles
Rest of CApplication
CPU objectfiles
CPU-GPU
Executable
NVCC
C for CUDAApplication
Linker
Combined CPU-G
7/29/2019 01_TeslaCUDAIntro
18/33
Shared back-end compiler
and optimization technology
NVIDIA C for CUDA and OpenCL
OpenCL
C for CUDA
PTX
GPU
Entry poinwho prefe
Entry point for developerswho want low-level API
7/29/2019 01_TeslaCUDAIntro
19/33
Different Programming Styles
C for CUDA
C with parallel keywordsC runtime that abstracts driver API
Memory managed by C runtime (familiar malloc, free)
Generates PTX
Low-level driver API optionally available
OpenCL
Hardware API - similar to OpenGL and CUDA driver A
Memory managed by programmer
Generates PTX
7/29/2019 01_TeslaCUDAIntro
20/33
Application Domains
A l ti Ti t Di
7/29/2019 01_TeslaCUDAIntro
21/33
Accelerating Time to Discovery
4.6 Days
27 Minutes
2.7 Days
30 Minutes
8 Hours
13 Minutes
3
CPU Only With GPU
7/29/2019 01_TeslaCUDAIntro
22/33
Computational Finance
31.1 secs
0
5
1015
20
25
30
35
Intel Xeon(2.6 GHz)
1 T
Time
(secs)Derivativ
Sc
164
2116
0
1000
2000
3000
4000
5000
6000
Mersenne Twiste+ Box-Mueller (M
Million
Samples
per sec
100x faster Randfor Monte
Intel Xeon Qua(3.0 GHz)Tesla C1060
Source:
Financial Computing Software vendors
SciComp : Derivatives pricing modeling
Hanweck: Options pricing & risk analysisAqumin: 3D visualization of market data
Exegy: High-volume Tickers & Risk Analysis
QuantCatalyst: Pricing & Hedging Engine
Oneye: Algorithmic Trading
Arbitragis Trading: Trinomial Options Pricing
Ongoing work
LIBOR Monte Carlo market model
Callable Swaps and Continuous Time Finance
Source
7/29/2019 01_TeslaCUDAIntro
23/33
0.890
50100
150
200
250
300
Intel QX670quad-core w
SSE
Billion
Evaluations/sec Ion Place
271xFaster
0
100
200
300
400
500
600
N=24,300 NTimestepscalculated/sec
Numbe
Lennard-Jon LAM
Source: Stone, P
Molecular Dynamics
Available MD software
NAMD / VMD (alpha release)HOOMD
ACE-MD
MD-GPU
Ongoing work
LAMMPSCHARMM
GROMACS
AMBER
Source: Anders
7/29/2019 01_TeslaCUDAIntro
24/33
4.4 secs
1.1 mins
4.7 min
0.2 secs
1.2 secs
4
0.1
1
10
100
1000
Caffeine Cholesterol Taxo
Time(Log-scale)
GAMESS on Intel Penvs CUDA code on Tes
Quantum Chemistry
2.8 mins
8 mins
4
21.6 secs 32.0 sec
0
100
200
300
400
500600
700
Taxol/ LSDA/3-21G
Taxol/PW91/ 6-
31G
VLS
Time
(secs)
Coulomb PotenGaussian 03 on Inte
vs CUDA co de on 1
Source: Ufimtsev,
Source: Yas
Available MD software
NAMD / VMD (alpha release)HOOMD
ACE-MD
MD-GPU
Ongoing work
LAMMPSCHARMM
Q-Chem
Gaussian
7/29/2019 01_TeslaCUDAIntro
25/33
Computational Fluid Dynamics (CFD)
0.9 0.6
24
0
10
20
30
40
50
60
128x32x128
256x3x256
Gflops Incompress
AMD Opteron 2.1 Tesla C8702 Tesla C870s4 Tesla C870s
4.8 7.60
100
200
300
400
500600
700
Intel Xeon(3.4 GHz)
InteItanium(1.4 G
MillionLattice
Updates
per
Sec
(MLUPs)
Lattice Boltz
for 128x12
Source: Thiba
Source: To
Ongoing work
Navier-StokesLattice Boltzman
3D Euler Solver
Weather and ocean modeling
7/29/2019 01_TeslaCUDAIntro
26/33
Electromagnetics / Electrodynamics
9.9
Mcells/s0
100
200
300
400
500
600
Intel Xeon (2.6 GHz)
Speed
Mcells/s
Cell Phone MoSimulation si
FDTD AcceleratSource: Ac
FDTD Solvers
Acceleware
EM Photonics
CUDA Tutorial
Ongoing work
Maxwell equation solver
Ring Oscillator (FDTD)
Particle beam dynamics simulator
7/29/2019 01_TeslaCUDAIntro
27/33
Weather, Atmospheric, & Ocean Modeling
1,315Mflops/s
0
10000
2000030000
40000
50000
60000
70000
Intel Xeon(3.0 GHz)
Mflops/s
WSM5 Mi
5 day
0
50
100
150
200
250
300
350
Intel Xeon (2
Time
(mins)
T3000km
Source: Mic
CUDA-accelerated WRF available
Other kernels in WRF being ported
Ongoing work
Tsunami modeling
Ocean modeling
Several CFD codes
Source: Mat
7/29/2019 01_TeslaCUDAIntro
28/33
Libraries
FFT P f CPU GPU (8 S i )
7/29/2019 01_TeslaCUDAIntro
29/33
FFT Performance: CPU vs GPU (8-Series)
0
10
20
30
40
50
60
70
80
90
GFLOPS
Transform Size (Power of 2)
1D Fast Fourier TransformOn CUDA
CUFFT 2.x
CUFFT 1.1
INTEL MKL 10.0
FFTW 3.x
NVIDIA Tesla C870 GPU (8-series GPU)
Quad-Core Intel Xeon CPU 5400 Series 3.0GHz,
In-place, complex, single precision
Intel
calc
sam
Rea
~10
Source for Intel data : http://www.intel.com/cd/software/products/asmo-na/eng/266852.htm
BLAS CPU GPU
7/29/2019 01_TeslaCUDAIntro
30/33
BLAS: CPU vs GPU
cuBLAS 2.2: Tesla C1060MKL 10.0.3.020: Quad-core 3.0GHz Core2 Extreme (4 thread
0
50
100
150
200
250
300
350
400
GFLOPS
Matrix Size M (MxM square)
Single Precision BLAS: SGEMM
cuBLAS 2.2
MKL 10.0.3
0
10
20
30
40
50
60
70
80
GFLOPS
Matrix Size M (Mx
Double Precision BLA
GPU + CPU DGEMM P f
7/29/2019 01_TeslaCUDAIntro
31/33
GPU + CPU DGEMM Performance
0
20
40
60
80
100
120
128
320
512
704
896
1088
1280
1472
1664
1856
2048
2240
2432
2624
2816
3008
3200
3392
3584
3776
3968
4160
4352
4544
4736
4928
5120
5312
5504
5696
GFLOPs
Size
Xeon Quad-core 2.8 GHz, MKL 10
Tesla C1060 GPU (1.296 GHz)
GPU + CPU
GPU + CP
GPU on
CPU onl
Results: Sparse Matrix Vector Multiplication (SpMV) on C
7/29/2019 01_TeslaCUDAIntro
32/33
Results: Sparse Matrix-Vector Multiplication (SpMV) on C
CPU Results from Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et
7/29/2019 01_TeslaCUDAIntro
33/33
Questions?
http://www.nvidia.com/CUDA