01_TeslaCUDAIntro

7/29/2019 01_TeslaCUDAIntro

1/33

Tesla GPU ComputingA Revolut ion in High Performance Compu t ing

Monash University GPU Computing Workshop

May 28, 2009


2/33

4 cores

What is GPU Computing?

Computing with CPU + GPUHeterogeneous Computin g


3/33

GPUs: Turning Point in Supercomputing

Te

SuCalcUA$5 Mil l ion

Source: University of Antwerp, Belgium

67.4 secs

59.9 secs

55 60 65 70

256 AMD dual-core Opterons

4 TeslaC1060 GPUs

Digital TomographyReconstruction Time

Desktop beats Cluster


4/33

146X

Medical ImagingU of Utah

36X

Molecular DynamicsU of Illinois, Urbana

18X

Video TranscodingElemental Tech

50X

Matlab ComputingAccelerEyes

149X

Financial simulationOxford

47X

Linear AlgebraUniversidad Jaime

20X

3D UltrasoundTechniscan

130X

Quantum ChemistryU of Illinois, Urbana

G

Not 2x or 3x : Speedups are 20x to 150x


5/33

L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1

L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1

NVIDIA Tesla 10-Series GPU

Massively parallel, many core a


6/33


7/33

TeslaTMHigh-Performance Computing

QuaDesign &

GeForce

Entertainment

Parallel Computing on All GPUs100+ Mil l ion CUDA GPUs Deployed


8/33

Tesla GPU Computing Products

Tesla S1070 1U SystemTesla C1060

Computing Board

TeslSupe

(4 Te

GPUs 4 Tesla GPUs 1 Tesla GPU 4

Single Precision Perf 4.14 Teraflops 933 Gigaflops 3

Double Precision Perf 346 Gigaflops 78 Gigaflops 3

Memory 4 GB / GPU 4 GB


9/33

Tesla S1070: Green Supercomputing

20X BetterPerformance / Watt

Hess

Chevron

Petrobras

NCSA

CEA

Tokyo Tech

JFCOM

SAIC

Federal

Motorola

Kodak

Univer

Univer

Univer

Max Pl

Rice U

Univer

GusGu

Eotvas

Univer

Chines

Nation


10/33

A $5 Million Datacenter

CPU 1U Server CPU 1

Tesla 1

2 Quad-core XeonCPUs: 8 cores

0.17 Teraflop (single)0.08 Teraflop (double)

$ 3,000

700 W

1819 CPU servers

310 Teraflops (single)

155 Teraflops (double)

Total area 16K sq feet

Total 1273 KW

8 CPU Cores +4 GPUs = 968 cores

4.14 Teraflops (single)0.346 Teraflop (double)

$ 11,000

1500 W

455 CPU servers

455 Tesla systems1961 Teraflops (single)

196 Teraflops (double)

Total area 9K sq feet

Total 682 KW


11/33

Introducing the Tesla Personal Supercomp

Supercomputing PerformanMassively parallel CUDA Archite

960 cores. 4 TeraFlops250x the performance of a desk

PersonalOne researcher, one supercomp

Plugs into standard power strip

AccessibleProgram in C for Windows, Linu

Available now worldwide under


12/33

More Information

Tesla main page

http://www.nvidia.com/tesla

Product Specs, MarketingLiterature

CUDA Zone

http://www.nvidia.com/cuda

Applications, Papers, Videos

YouTube Videoshttp://www.youtube.co

Hear CUDA develabout their experi
http://www.nvidia.com/teslahttp://www.nvidia.com/cudahttp://www.youtube.com/nvidiateslahttp://www.youtube.com/nvidiateslahttp://www.nvidia.com/cudahttp://www.nvidia.com/tesla


13/33

CUDA Parallel Programming Architectureand Programming Model


14/33

CUDA Parallel Computing Architecture

ATIs Compute Solution

Parallel computing architectureand programming model

Includes a C compiler plussupport for OpenCL and

DX11 Compute

Architected to natively supportall computational interfaces

(standard languages and APIs)


15/33

CUDA Facts

750+ Research Papers

50+ universities teaching CUDA

100 Million CUDA-Enabled GPUs

25K Active Developers

www.NVIDIA.com


16/33

Simple C Description For Parallelism

void saxpy_serial(int n, float a, float *x, float *y){ for (int i = 0; i < n; ++i)y[i] = a*x[i] + y[i];}// Invoke serial SAXPY kernelsaxpy_serial(n, 2.0, x, y);__global__ void saxpy_parallel(int n, float a, float *x, float { int i = blockIdx.x*blockDim.x + threadIdx.x;if (i < n) y[i] = a*x[i] + y[i];}// Invoke parallel SAXPY kernel with 256 threads/blockint nblocks = (n + 255) / 256;saxpy_parallel(n, 2.0, x, y);

Standa

Para


17/33

Compiling C for CUDA Applications

CUDACC CPU Compiler

C for CUDAKernels

CUDA objectfiles

Rest of CApplication

CPU objectfiles

CPU-GPU

Executable

NVCC

C for CUDAApplication

Linker

Combined CPU-G


18/33

Shared back-end compiler

and optimization technology

NVIDIA C for CUDA and OpenCL

OpenCL

C for CUDA

PTX

GPU

Entry poinwho prefe

Entry point for developerswho want low-level API


19/33

Different Programming Styles

C for CUDA

C with parallel keywordsC runtime that abstracts driver API

Memory managed by C runtime (familiar malloc, free)

Generates PTX

Low-level driver API optionally available

OpenCL

Hardware API - similar to OpenGL and CUDA driver A

Memory managed by programmer

Generates PTX


20/33

Application Domains

A l ti Ti t Di


21/33

Accelerating Time to Discovery

4.6 Days

27 Minutes

2.7 Days

30 Minutes

8 Hours

13 Minutes

3

CPU Only With GPU


22/33

Computational Finance

31.1 secs

0

5

1015

20

25

30

35

Intel Xeon(2.6 GHz)

1 T

Time

(secs)Derivativ

Sc

164

2116

0

1000

2000

3000

4000

5000

6000

Mersenne Twiste+ Box-Mueller (M

Million

Samples

per sec

100x faster Randfor Monte

Intel Xeon Qua(3.0 GHz)Tesla C1060

Source:

Financial Computing Software vendors

SciComp : Derivatives pricing modeling

Hanweck: Options pricing & risk analysisAqumin: 3D visualization of market data

Exegy: High-volume Tickers & Risk Analysis

QuantCatalyst: Pricing & Hedging Engine

Oneye: Algorithmic Trading

Arbitragis Trading: Trinomial Options Pricing

Ongoing work

LIBOR Monte Carlo market model

Callable Swaps and Continuous Time Finance

Source


23/33

0.890

50100

150

200

250

300

Intel QX670quad-core w

SSE

Billion

Evaluations/sec Ion Place

271xFaster

0

100

200

300

400

500

600

N=24,300 NTimestepscalculated/sec

Numbe

Lennard-Jon LAM

Source: Stone, P

Molecular Dynamics

Available MD software

NAMD / VMD (alpha release)HOOMD

ACE-MD

MD-GPU

Ongoing work

LAMMPSCHARMM

GROMACS

AMBER

Source: Anders


24/33

4.4 secs

1.1 mins

4.7 min

0.2 secs

1.2 secs

4

0.1

1

10

100

1000

Caffeine Cholesterol Taxo

Time(Log-scale)

GAMESS on Intel Penvs CUDA code on Tes

Quantum Chemistry

2.8 mins

8 mins

4

21.6 secs 32.0 sec

0

100

200

300

400

500600

700

Taxol/ LSDA/3-21G

Taxol/PW91/ 6-

31G

VLS

Time

(secs)

Coulomb PotenGaussian 03 on Inte

vs CUDA co de on 1

Source: Ufimtsev,

Source: Yas

Available MD software

NAMD / VMD (alpha release)HOOMD

ACE-MD

MD-GPU

Ongoing work

LAMMPSCHARMM

Q-Chem

Gaussian


25/33

Computational Fluid Dynamics (CFD)

0.9 0.6

24

0

10

20

30

40

50

60

128x32x128

256x3x256

Gflops Incompress

AMD Opteron 2.1 Tesla C8702 Tesla C870s4 Tesla C870s

4.8 7.60

100

200

300

400

500600

700

Intel Xeon(3.4 GHz)

InteItanium(1.4 G

MillionLattice

Updates

per

Sec

(MLUPs)

Lattice Boltz

for 128x12

Source: Thiba

Source: To

Ongoing work

Navier-StokesLattice Boltzman

3D Euler Solver

Weather and ocean modeling


26/33

Electromagnetics / Electrodynamics

9.9

Mcells/s0

100

200

300

400

500

600

Intel Xeon (2.6 GHz)

Speed

Mcells/s

Cell Phone MoSimulation si

FDTD AcceleratSource: Ac

FDTD Solvers

Acceleware

EM Photonics

CUDA Tutorial

Ongoing work

Maxwell equation solver

Ring Oscillator (FDTD)

Particle beam dynamics simulator


27/33

Weather, Atmospheric, & Ocean Modeling

1,315Mflops/s

0

10000

2000030000

40000

50000

60000

70000

Intel Xeon(3.0 GHz)

Mflops/s

WSM5 Mi

5 day

0

50

100

150

200

250

300

350

Intel Xeon (2

Time

(mins)

T3000km

Source: Mic

CUDA-accelerated WRF available

Other kernels in WRF being ported

Ongoing work

Tsunami modeling

Ocean modeling

Several CFD codes

Source: Mat


28/33

Libraries

FFT P f CPU GPU (8 S i )


29/33

FFT Performance: CPU vs GPU (8-Series)

0

10

20

30

40

50

60

70

80

90

GFLOPS

Transform Size (Power of 2)

1D Fast Fourier TransformOn CUDA

CUFFT 2.x

CUFFT 1.1

INTEL MKL 10.0

FFTW 3.x

NVIDIA Tesla C870 GPU (8-series GPU)

Quad-Core Intel Xeon CPU 5400 Series 3.0GHz,

In-place, complex, single precision

Intel

calc

sam

Rea

~10

Source for Intel data : http://www.intel.com/cd/software/products/asmo-na/eng/266852.htm

BLAS CPU GPU


30/33

BLAS: CPU vs GPU

cuBLAS 2.2: Tesla C1060MKL 10.0.3.020: Quad-core 3.0GHz Core2 Extreme (4 thread

0

50

100

150

200

250

300

350

400

GFLOPS

Matrix Size M (MxM square)

Single Precision BLAS: SGEMM

cuBLAS 2.2

MKL 10.0.3

0

10

20

30

40

50

60

70

80

GFLOPS

Matrix Size M (Mx

Double Precision BLA

GPU + CPU DGEMM P f


31/33

GPU + CPU DGEMM Performance

0

20

40

60

80

100

120

128

320

512

704

896

1088

1280

1472

1664

1856

2048

2240

2432

2624

2816

3008

3200

3392

3584

3776

3968

4160

4352

4544

4736

4928

5120

5312

5504

5696

GFLOPs

Size

Xeon Quad-core 2.8 GHz, MKL 10

Tesla C1060 GPU (1.296 GHz)

GPU + CPU

GPU + CP

GPU on

CPU onl

Results: Sparse Matrix Vector Multiplication (SpMV) on C


32/33

Results: Sparse Matrix-Vector Multiplication (SpMV) on C

CPU Results from Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et


33/33

Questions?

http://www.nvidia.com/CUDA

[email protected]

Documents

01_TeslaCUDAIntro