IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

IAP09 CUDA@MIT/6.963

David Luebke NVIDIA Research

© NVIDIA Corporation 2008

The “New” Moore’s Law

Computers no longer get faster, just wider

You must re-think your algorithms to be parallel !

Data-parallel computing is most scalable solution


Enter the GPU

Massive economies of scale

Massively parallel


Enter CUDA

Scalable parallel programming model

Minimal extensions to familiar C/C++ environment

Heterogeneous serial-parallel computing


Sound Bite

GPUs + CUDA =

The Democratization of Parallel Computing

Massively parallel computing has become a commodity technology

MOTIVATION

MOTIVATION

146X 36X 19X 17X 100X

149X 47X 20X 24X 30X


CUDA: ‘C’ FOR PARALLELISM

void saxpy_serial(int n, float a, float *x, float *y)

{

for (int i = 0; i < n; ++i)

y[i] = a*x[i] + y[i];

} // Invoke serial SAXPY kernel

saxpy_serial(n, 2.0, x, y);

__global__ void saxpy_parallel(int n, float a, float *x, float *y)

{

int i = blockIdx.x*blockDim.x + threadIdx.x;

if (i < n) y[i] = a*x[i] + y[i];

}

// Invoke parallel SAXPY kernel with 256 threads/block

int nblocks = (n + 255) / 256;

saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);

Standard C Code

Parallel C Code


Hierarchy of concurrent threads

Parallel kernels composed of many threads all threads execute the same sequential program

Threads are grouped into thread blocks threads in the same block can cooperate

Threads/blocks have unique IDs

Thread t

t0 t1 … tB

Block b

Kernel foo()

. . .


Hierarchical organization

Thread

per-thread local memory

Block

per-block shared memory

Kernel 0

. . . per-device

global memory

. . .

Kernel 1

. . . Global barrier

Local barrier


Heterogeneous Programming

CUDA = serial program with parallel kernels, all in C Serial C code executes in a CPU thread Parallel kernel C code executes in thread blocks

across multiple processing elements

Serial Code

. . .

. . .

Parallel Kernel foo<<< nBlk, nTid >>>(args);

Serial Code

Parallel Kernel bar<<< nBlk, nTid >>>(args);


Thread = virtualized scalar processor

Independent thread of execution has its own PC, variables (registers), processor state, etc. no implication about how threads are scheduled

CUDA threads might be physical threads as on NVIDIA GPUs

CUDA threads might be virtual threads might pick 1 block = 1 physical thread on multicore CPU


Block = virtualized multiprocessor

Provides programmer flexibility freely choose processors to fit data freely customize for each kernel launch

Thread block = a (data) parallel task all blocks in kernel have the same entry point but may execute any code they want

Thread blocks of kernel must be independent tasks program valid for any interleaving of block executions


Blocks must be independent

Any possible interleaving of blocks should be valid presumed to run to completion without pre-emption can run in any order can run concurrently OR sequentially

Blocks may coordinate but not synchronize shared queue pointer: OK shared lock: BAD … can easily deadlock

Independence requirement gives scalability


Scalable Execution Model

Kernel launched by host

. . .

SP

Shared Memory

MT IU

SP

Shared Memory

MT IU

SP

Shared Memory

MT IU

SP

Shared Memory

MT IU

SP

Shared Memory

MT IU

SP

Shared Memory

MT IU

SP

Shared Memory

MT IU

SP

Shared Memory

MT IU

. . .

Device Memory

Blocks Run on Multiprocessors


Synchronization & Cooperation

Threads within block may synchronize with barriers … Step 1 …

__syncthreads(); … Step 2 …

Blocks coordinate via atomic memory operations e.g., increment shared queue pointer with atomicInc()

Implicit barrier between dependent kernels vec_minus<<<nblocks, blksize>>>(a, b, c);

vec_dot<<<nblocks, blksize>>>(c, c);


Using per-block shared memory

Variables shared across block __shared__ int *begin, *end;

Scratchpad memory __shared__ int scratch[blocksize]; scratch[threadIdx.x] = begin[threadIdx.x];

// … compute on scratch values … begin[threadIdx.x] = scratch[threadIdx.x];

Communicating values between threads scratch[threadIdx.x] = begin[threadIdx.x]; __syncthreads();

int left = scratch[threadIdx.x - 1];

Block

Shared


Example: Parallel Reduction

Summing up a sequence with 1 thread: int sum = 0; for(int i=0; i<N; ++i) sum += x[i];

Parallel reduction builds a summation tree each thread holds 1 element stepwise partial sums N threads need log N steps

one possible approach: Butterfly pattern


Example: Parallel Reduction

Summing up a sequence with 1 thread: int sum = 0; for(int i=0; i<N; ++i) sum += x[i];

Parallel reduction builds a summation tree each thread holds 1 element stepwise partial sums N threads need log N steps

one possible approach: Butterfly pattern


Parallel Reduction for 1 Block

// INPUT: Thread i holds value x_i int i = threadIdx.x; __shared__ int sum[blocksize];

// One thread per element sum[i] = x_i; __syncthreads();

for(int bit=blocksize/2; bit>0; bit/=2) { int t=sum[i]+sum[i^bit]; __syncthreads(); sum[i]=t; __syncthreads(); } // OUTPUT: Every thread now holds sum in sum[i]


Summing Up CUDA

CUDA = C + a few simple extensions makes it easy to start writing basic parallel programs

Three key abstractions: 1. hierarchy of parallel threads 2. corresponding levels of synchronization 3. corresponding memory spaces

Supports massive parallelism of manycore GPUs


Parallel

// INPUT: Thread i holds value x_i int i = threadIdx.x; __shared__ int sum[blocksize];

// One thread per element sum[i] = x_i; __syncthreads();

for(int bit=blocksize/2; bit>0; bit/=2) { int t=sum[i]+sum[i^bit]; __syncthreads(); sum[i]=t; __syncthreads(); } // OUTPUT: Every thread now holds sum in sum[i]


More efficient reduction tree

template<class T> __device__ T reduce(T *x) { unsigned int i = threadIdx.x; unsigned int n = blockDim.x;

for(unsigned int offset=n/2; offset>0; offset/=2) { if(tid < offset) x[i] += x[i + offset]; __syncthreads(); }

return x[i]; // Note that only thread 0 has full sum }


Reduction tree execution example

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2

Input (shared memory)

0 1 2 3 4 5 6 7

8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2

0 1 2 3

8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

0 1

21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

0

41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

Final result

active threads

x[i] += x[i+8];

x[i] += x[i+4];

x[i] += x[i+2];

x[i] += x[i+1];


Pattern 1: Fit kernel to the data

Code lets B-thread block reduce B-element array

For larger sequences: launch N/B blocks to reduce each B-element subsequence write N/B partial sums to temporary array repeat until done

We’ll be done in logB N steps


Pattern 2: Fit kernel to the machine

For a block size of B: 1) Launch B blocks to sum N/B elements 2) Launch 1 block to combine B partial sums

We’ll be done in 2 steps

BUT ... need to make sure B blocks fill machine

4 7 5 9 11 14

25

3 1 7 0 4 1 6 3 4 7 5 9

11 14 25

3 1 7 0 4 1 6 3 4 7 5 9

11 14 25

3 1 7 0 4 1 6 3 4 7 5 9

11 14 25

3 1 7 0 4 1 6 3 4 7 5 9

11 14 25

3 1 7 0 4 1 6 3 4 7 5 9

11 14 25

3 1 7 0 4 1 6 3 4 7 5 9

11 14 25

3 1 7 0 4 1 6 3 4 7 5 9

11 14 25

3 1 7 0 4 1 6 3

4 7 5 9 11 14

25

3 1 7 0 4 1 6 3

Stage 1: many blocks

Stage2: 1 block


Pattern 3: Atomic accumulation

Create a variable __device__ total;

At the end of each block, call atomicAdd() to accumulate partial sum into total.

We’ll be done in 1 step

BUT … we’ve created a global serialization point between blocks contention may be high, esp. for small block sizes atomics support only limited types / operators


THINK DATA-PARALLEL: SCAN

• © NVIDIA Corporation 2007

Common Situations in Parallel Computation

Many parallel threads need to generate a single result value Reduce

Many parallel threads that need to partition data Split

Many parallel threads and variable output per thread Compact / Expand / Allocate


F T F F T F F T

F F F F F T T T

3 6 1 4 0 7 1 3

3 1 4 7 1 6 0 3

Flag

Payload

Split Operation

Given an array of true and false elements (and payloads)

Return an array with all true elements at the beginning

Examples: sorting, building trees


Variable Output Per Thread: Compact

3 7 4 1 3

3 0 7 0 4 1 0 3

Remove null elements


Variable Output Per Thread

Allocate Variable Storage Per Thread

Examples: marching cubes, geometry generation

A

B

C D

E

F

G

2 1 0 3 2

H


“Where do I write my output?”

In all of these situations, each thread must answer that simple question

The answer is:

“That depends (on how much the other threads need to write)!”

“Scan” is an efficient way to answer this question in parallel


Applications of Scan

Scan is a simple and useful parallel building block for many parallel algorithms:

Fascinating, since scan is unnecessary in sequential computing!

radix sort quicksort (segmented

scan) String comparison Lexical analysis Stream compaction Run-length encoding

Polynomial evaluation Solving recurrences Tree operations Histograms Allocation Etc.


Parallel Scan (a.k.a. prefix sum)

Given sequence A of n elements and an associative operator ⊕ with identity I

Inclusive scan: scan(A, ⊕) = [ A0, A0 ⊕ A1, …, A0 ⊕ ··· ⊕ An-1]

Exclusive scan: prescan(A, ⊕) = [ I, A0, …, A0 ⊕ ··· ⊕ An-2]

Ex: A = [3 1 7 0 4 1 6 3] prescan(A,+) = [0 3 4 11 11 15 16 22]


Inclusive Scan across 1 block

// After Hillis & Steele, Data parallel algorithms, CACM 1986. template<class T> __device__ T plus_scan(T *x) { unsigned int i = threadIdx.x; unsigned int n = blockDim.x;

for(unsigned int offset=1; offset<n; offset *= 2) { T t; if(i>=offset) t = x[i-offset]; __syncthreads(); if(i>=offset) x[i] = t + x[i]; __syncthreads(); } }


Combining results across blocks

Much like reduction case Pattern 1: Fit kernel to the data … O(logB N) steps Pattern 2: Fit kernel to the machine … O(1) steps

Now need to add partial sums back to partial results kernels need to save partial results need additional kernel to add partial sums back in

Sketch of algorithm using pattern 2: 1) B blocks compute partial results for subsequences 2) 1 block performs scan of partial sums from (1) 3) N/B blocks add results from (2) into results from (1)


Directions for improvement

Templating for generic operators scan<plus>() much more powerful than plus_scan()

Pay attention to operator properties associative? commutative? has inverse? has identity? the fewer that are true, the more careful the code must be e.g., reduction examples assume commutative w/ identity

Many opportunities for code optimization remove inefficiencies (auto-) tuning for different cases


Directions for optimization

Loop unrolling with fixed block sizes almost always a good idea

Minimize unutilized threads reduction example leaves half its threads idle

Improve work efficiency scan example does O(n log n) work

Watch out for divergence and incoherent load/store sometimes at odds with earlier directions


GPU ARCHITECTURE


GPU ARCHITECTURE

Communication Fabric

Mem

ory & I/O

Fi

xed

Func

tion

Acc

eler

atio

n

240 scalar cores

On-chip memory

Texture units

GeForce GTX 280 / Tesla T10


SM

DP

SFU

MT Issue

Inst. Cache

Const. Cache

SFU

Memory

SP SP SP SP SP SP SP SP

Key Architectural Ideas

Massive multithreading Hide latency with computation not cache

SIMT execution (Single Instruction Multiple Thread)

Warps – threads run in groups of 32

HW handles divergence gracefully

SM


Reduction tree redux

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2

Input (shared memory)

0 1 2 3 4 5 6 7

8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2

0 1 2 3

8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

0 1

21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

0

41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

Final result

active threads

x[i] += x[i+8];

x[i] += x[i+4];

x[i] += x[i+2];

x[i] += x[i+1];


Compare to interleaved addressing: Input (shared memory)

x[i] += x[i+8];

x[i] += x[i+4];

x[i] += x[i+2];

x[i] += x[i+1];

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2

0 1 2 3 4 5 6 7

11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2

0 1 2 3

18 1 7 -1 6 -2 8 5 4 -3 9 7 13 11 2 2

0 1

24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2

0

41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2


SOME FINAL THOUGHTS

Faster is not “just faster” -- David Kirk, Chief Scientist

2-3X faster is “just faster” Do a little more, wait a little less Doesn’t change how you work

5-10x faster is “significant” Worth upgrading Worth re-writing (parts of) the application

100x (or more!) faster is “fundamentally different” Worth considering a new platform Worth re-architecting the application Makes new applications possible Drives “time to discovery”, changes science


SOME FINAL THOUGHTS

We should teach parallel computing in CS 1 or CS 2

Remember: computers don’t get faster, just wider

Heapsort and mergesort

Both O(n lg n)

One parallel-friendly, one not

Students need to understand this early


Conclusion

GPUs are massively parallel manycore computers Ubiquitous - most successful parallel processor in history Useful - users achieve huge speedups on real problems

CUDA is a powerful parallel programming model Heterogeneous - mixed serial-parallel programming Scalable - hierarchical thread execution model Accessible - minimal but expressive changes to C

They provide tremendous scope for innovative research

Questions?

Copyright NVIDIA 2008 49

Questions & Discussion

Lots more slides…

CUDA programming Advanced examples: sparse matrix-vector, ray tracing

Performance: tools and applications BLAS, FFT, sort, sparse matrix-vector Fluid dynamics, molecular dynamics, n-body, discontinuous Galerkin

GPU architecture

Myths of GPU Computing

Otherwise: [email protected]

Example: Sparse Matrix-Vector


Sparse matrix-vector multiplication

Sparse matrices have relatively few non-zero entries Frequently O(n) rather than O(n2) Only store & operate on these non-zero entries

3 0 1 0

0 0 0 0

0 2 4 1

1 0 0 1

Av[7] = { 3, 1, 2, 4, 1, 1, 1 };

Aj[7] = { 0, 2, 1, 2, 3, 0, 3 };

Ap[5] = { 0, 2, 2, 5, 7 };

Non-zero values

Column indices

Row pointers

Row 0 Row 2 Row 3

Example: Compressed Sparse Row (CSR) Format



float multiply_row(uint rowsize, // number of non-zeros in row uint *Aj, // column indices for row float *Av, // non-zero entries for row float *x) // the RHS vector { float sum = 0;

for(uint column=0; column<rowsize; ++column) sum += Av[column] * x[Aj[column]];

return sum; }

Av[7] = { 3, 1, 2, 4, 1, 1, 1 };

Aj[7] = { 0, 2, 1, 2, 3, 0, 3 };

Ap[5] = { 0, 2, 2, 5, 7 };

Non-zero values

Column indices

Row pointers

Row 0 Row 2 Row 3



float multiply_row(uint size, uint *Aj, float *Av, float *x);

void csrmul_serial(uint *Ap, uint *Aj, float *Av, uint num_rows, float *x, float *y) { for(uint row=0; row<num_rows; ++row) { uint row_begin = Ap[row]; uint row_end = Ap[row+1];

y[row] = multiply_row(row_end-row_begin, Aj+row_begin, Av+row_begin, x); } }



float multiply_row(uint size, uint *Aj, float *Av, float *x);

__global__ void csrmul_kernel(uint *Ap, uint *Aj, float *Av, uint num_rows, float *x, float *y) { uint row = blockIdx.x*blockDim.x + threadIdx.x;

if( row<num_rows ) { uint row_begin = Ap[row]; uint row_end = Ap[row+1];

y[row] = multiply_row(row_end-row_begin, Aj+row_begin, Av+row_begin, x); } }


Compare with OpenMP void csrmul_openmp(… … … … … …) { #pragma omp parallel for for(uint row=0; row<num_rows; ++row) { uint row_begin = Ap[row]; uint row_end = Ap[row+1];

y[row] = multiply_row(… … … …); } }

OpenMP Kernel

__global__ void csrmul_kernel(… … … … … …) { uint row = blockIdx.x*blockDim.x + threadIdx.x;

if( row<num_rows ) { uint row_begin = Ap[row]; uint row_end = Ap[row+1];

y[row] = multiply_row(… … … …); } }

CUDA Kernel


Adding a simple caching scheme

__global__ void csrmul_cached(… … … … … …) { uint begin = blockIdx.x*blockDim.x, end = begin+blockDim.x; uint row = begin + threadIdx.x;

__shared__ float cache[blocksize]; // array to cache rows if( row<num_rows) cache[threadIdx.x] = x[row]; // fetch to cache __syncthreads();

if( row<num_rows ) { uint row_begin = Ap[row], row_end = Ap[row+1]; float sum = 0;

for(uint col=row_begin; col<row_end; ++col) { uint j = Aj[col];

// Fetch from cached rows when possible float x_j = (j>=begin && j<end) ? cache[j-begin] : x[j];

sum += Av[col] * x_j; }

y[row] = sum; } }

multiply_row()

Example: Scan

Example: Multigrid

Poisson Solver


Results: Multigrid Poisson Solver

do k = 1, nz do j = 1, ny do i = 1+mod(k+j+redblack,2), nx, 2 residual = -hsq*B(i,j,k) - U(I,j,k)*6 + & (U(i+1,j,k) + U(i-1,j,k)) + & (U(i,j+1,k) + U(i,j-1,k)) + & (U(i,j,k-1) + U(i,j,k+1)) U(i,j,k) = U(i,j,k) + (omega/6)*residual end do end do end do

Gauss-Seidel Red-Black smoother main loop - FORTRAN



__global__ void Relax( Grid U, Grid B, float omega, int red_black) { int i = blockIdx.x*blockDim.x + threadIdx.x; int j = blockIdx.y*blockDim.y + threadIdx.y; int k = blockIdx.z*blockDim.z + threadIdx.z; if ((j+k+i)%2 != red_black) return; int idx = i + j*U.jstride + k*U.kstride; float residual = -U.hsq*B.buf[idx] - 6.0f*U.buf[idx] + U.buf[idx + U.istride] + U.buf[idx – U.istride] + U.buf[idx + U.jstride] + U.buf[idx – U.jstride] + U.buf[idx + 1 ] + U.buf[idx – 1 ]; U.buf[idx] += omega * residual/6.0f; }

Gauss-Seidel Red-Black smoother kernel - CUDA



Optimal version is slightly more complex Fast math intrinsics, optimized memory layout, etc.

Single precision 1 step of full multigrid GTX280 vs 3.16 GHz Core2Duo E8500 (using 1 core)

GTX280 Core2 E8500

32 x 32 x 32 0.01 sec 0.02 sec

64 x 64 x 64 0.02 sec 0.14 sec

128 x 128 x 128 0.06 sec 1.16 sec

256 x 256 x 256 0.33 sec 9.27 sec

Example: Vector Add w/ Host Code


Example: Vector Addition Kernel

// Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; }

int main() { // Run N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C); }

Device Code


Example: Vector Addition Kernel

// Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; }

int main() { // Run N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C); }

Host Code


Example: Host code for vecAdd

// allocate and initialize host (CPU) memory float *h_A = …, *h_B = …;

// allocate device (GPU) memory float *d_A, *d_B, *d_C; cudaMalloc( (void**) &d_A, N * sizeof(float)); cudaMalloc( (void**) &d_B, N * sizeof(float)); cudaMalloc( (void**) &d_C, N * sizeof(float));

// copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float),

cudaMemcpyHostToDevice) ); cudaMemcpy( d_B, h_B, N * sizeof(float),

cudaMemcpyHostToDevice) );

// execute the kernel on N/256 blocks of 256 threads each vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);

Example: Ray Tracing


Implications & Strategy

GTX 280 supports up to 30,720 concurrent threads!

Implication: minimize state per thread

Strategy: replace traversal stack with short stack

Horn et al., Interactive k-D Tree GPU Raytracing, I3D 2008

The following slides are borrowed from Daniel Horn’s presentation


A

D C

KD-Tree

B

X

Y

Z

X

Y Z

A B C D

tmin

tmax


D C

A

B

X

Y

Z

KD-Tree Traversal

X

Y Z

A B C D

Z A

Stack:


D C

A

B

X

Y

Z

KD-Restart

Standard traversal Omit stack operations Proceed to 1st leaf

If no intersection Advance (tmin,tmax) Restart from root

Proceed to next leaf


D C

A

B

X

Y

Z

KD-Restart with short stack (size 1)

X

Y Z

A B C D

Z A

Stack: A


Short Stack Cache

Even better:

Each thread stores full stack in memory non-blocking writes

Cache top of stack locally (registers or shared memory)

Enables BVHs as well as k-d trees

5-10% faster in our current implementation

Products


Tesla S1070 1U System

1 single precision

2 typical power

4 Teraflops1

800 watts2


Tesla C1060 Board

1 single precision

2 typical power

957 Gigaflops1

160 Watts2


Building a 100TF datacenter

CPU 1U Server Tesla 1U System

10x lower cost

21x lower power

4 CPU cores 0.07 Teraflop

$ 2000

400 W

1429 CPU servers

$ 3.1 M

571 KW

4 GPUs: 960 cores 4 Teraflops

$ 8000

800 W

25 CPU servers 25 Tesla systems

$ 0.31 M

27 KW


Tesla Personal Supercomputer

Supercomputing Performance Massively parallel CUDA Architecture 960 cores. 4 TeraFlops 250x the performance of a desktop

Personal One researcher, one supercomputer Plugs into standard power strip

Accessible Program in C for Windows, Linux Available now worldwide under $10,000


CUDA SDK

NVIDIA C Compiler

NVIDIA Assembly for Computing CPU Host Code

Integrated CPU and GPU C Source Code

Libraries:FFT, BLAS,… Example Source Code

CUDA Driver

Debugger Profiler Standard C Compiler

GPU CPU

Applications


Example: Single Precision BLAS

0

50

100

150

200

250

300

350

GFL

OPS

Matrix Size

BLAS (SGEMM) on CUDA

CUDA

ATLAS 1 Thread ATLAS 4 Threads

CUBLAS:CUDA2.0b2,TeslaC1060(10‐seriesGPU)ATLAS3.81onDual2.8GHzOpteronDual‐Core


Example: Double Precision BLAS

0

10

20

30

40

50

60

70

GFL

OPS

Matrix Size

BLAS (DGEMM) on CUDA

CUBLAS

ATLAS Parallel

ATLAS Single

CUBLASCUDA2.0b2onTeslaC1060(10‐series)ATLAS3.81onIntelXeonE5440Quad‐core,2.83GHz


DGEMM Performance

GPU performance includes data copies over PCIe gen-2


CUDA-Simple: 14.8x faster,

33% of fastest GPU kernel

CUDA-Unroll8clx: fastest GPU kernel,

44x faster than CPU, 291 GFLOPS

on GeForce 8800GTX

GPU computing. J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, J. Phillips. Proceedings of the IEEE, 2008. In press.

CPU

http://www.ks.uiuc.edu/Research/namd/

Direct Coulomb Summation (VMD)


Fluid Dynamics

2-D compressible Euler equations irregular structured grid finite volume method (Ni, 1981)

E. H. Phillips, Y. Zhang, R. L. Davis, J. D. Owens. A multi-grid solver for the 2D compressible Euler equations on a GPU cluster. http://www.ece.ucdavis.edu/cerl/techreports/2008-2/


Solving Euler Equations

Divide domain into tiles & distribute amongst blocks

As always, try for: maximal work per tile minimal data exchange between tiles load balance within/between tiles


1.6K 25K 400K 1.6M 6.4M

4 18 22 22 22

4 22

39 42 43

2

22

69 80 85

1 14

110

136

160

Domain Size (nodes)

Parallel Speedup vs. Single Core

QuadroFX 5600 2x QuadroFX 5600

4x QuadroFX 5600 8x QuadroFX 5600

Solving Euler Equations


Discontinuous Galerkin Methods

Weak coupling between elements limited intra-element communication allows non-conforming meshes allows heterogeneous element types

Support local higher-order operations increased arithmetic per element but still local to the element

© NVIDIA Corporation 2007 CPU code achieves ~3GFLOPS/core

Full time-dependent electromagnetic scattering in an open volume surrounding an aircraft

from T. Warburton, A. Klockner, J. Hesthaven


We see just over 700 GFLOPS on 3xGTX 280

3

from T. Warburton, A. Klockner, J. Hesthaven

Multi-GPU Parallel Scaling

Sparse Linear Algebra Results


Sparse Matrix-Vector Multiplication (SpMV) on CUDA

Experimented with several data structures CSR: Compressed Sparse Row HYB: Hybrid of ELLPACK (ELL) and Coordinate (COO)

formats

HYB gave best results Speed of ELL with flexibility of COO

Benchmarked against matrices from “Optimization of Sparse Matrix-Vector Multiplication on

Emerging Multicore Platforms”, S. Williams et al, Supercomputing 2007


Results: Sparse Matrix-Vector Multiplication (SpMV) on CUDA

CPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007

Double Precision


T10 Double Precision Floating Point

Precision IEEE 754

Rounding modes for FADD and FMUL All 4 IEEE, round to nearest, zero, inf, -inf

Denormal handling Full speed

NaN support Yes

Overflow and Infinity support Yes

Flags No

FMA Yes

Square root Software with low-latency FMA-based convergence

Division Software with low-latency FMA-based convergence

Reciprocal estimate accuracy 24 bit

Reciprocal sqrt estimate accuracy 23 bit

log2(x) and 2^x estimates accuracy 23 bit


NVIDIA Tesla T10 SSE2 Cell SPE Precision IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADD and FMUL

All 4 IEEE, round to nearest, zero, inf, -inf



Denormal handling Full speed Supported, costs 1000’s of cycles

Supported only for results, not input operands (input denormals flushed-to-0)

NaN support Yes Yes Yes

Overflow and Infinity support Yes Yes Yes

Flags No Yes Yes

FMA Yes No Yes

Square root Software with low-latency FMA-based convergence Hardware Software only

Division Software with low-latency FMA-based convergence Hardware Software only

Reciprocal estimate accuracy 24 bit 12 bit 12 bit + step

Reciprocal sqrt estimate accuracy 23 bit 12 bit 12 bit + step

log2(x) and 2^x estimates accuracy 23 bit No No

Double Precision Floating Point


Single Precision Floating Point G80 SSE IBM Altivec Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADD and FMUL

Round to nearest and round to zero


Round to nearest only

Round to zero/truncate only

Denormal handling Flush to zero Supported, 1000’s of cycles

Supported, 1000’s of cycles Flush to zero

NaN support Yes Yes Yes No

Overflow and Infinity support

Yes, only clamps to max norm Yes Yes No, infinity

Flags No Yes Yes Some

Square root Software only Hardware Software only Software only

Division Software only Hardware Software only Software only

Reciprocal estimate accuracy 24 bit 12 bit 12 bit 12 bit

Reciprocal sqrt estimate accuracy 23 bit 12 bit 12 bit 12 bit

log2(x) and 2^x estimates accuracy 23 bit No 12 bit No

Quotes


GPUs have evolved to the point where many real world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs.

Jack Dongarra Professor, University of Tennessee Author of Linpack


We’ve all heard ‘desktop supercomputer’ claims in the past, but this time it’s for real: NVIDIA and its partners will be delivering outstanding performance and broad applicability to the mainstream marketplace.

Heterogeneous computing is what makes such a breakthrough possible.

Burton Smith Technical Fellow, Microsoft Formerly, Chief Scientist at Cray

Education

IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)