99
IAP09 CUDA@MIT/6.963 David Luebke NVIDIA Research

IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

  • Upload
    npinto

  • View
    7.375

  • Download
    7

Embed Size (px)

DESCRIPTION

See http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009

Citation preview

Page 1: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

IAP09 CUDA@MIT/6.963

David Luebke NVIDIA Research

Page 2: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

The “New” Moore’s Law

 Computers no longer get faster, just wider

 You must re-think your algorithms to be parallel !

 Data-parallel computing is most scalable solution

Page 3: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Enter the GPU

 Massive economies of scale

 Massively parallel

Page 4: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Enter CUDA

 Scalable parallel programming model

 Minimal extensions to familiar C/C++ environment

 Heterogeneous serial-parallel computing

Page 5: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Sound Bite

GPUs + CUDA =

The Democratization of Parallel Computing

Massively parallel computing has become a commodity technology

Page 6: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

MOTIVATION

Page 7: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

MOTIVATION

146X 36X 19X 17X 100X

149X 47X 20X 24X 30X

Page 8: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

CUDA: ‘C’ FOR PARALLELISM

void saxpy_serial(int n, float a, float *x, float *y)

{

for (int i = 0; i < n; ++i)

y[i] = a*x[i] + y[i];

} // Invoke serial SAXPY kernel

saxpy_serial(n, 2.0, x, y);

__global__ void saxpy_parallel(int n, float a, float *x, float *y)

{

int i = blockIdx.x*blockDim.x + threadIdx.x;

if (i < n) y[i] = a*x[i] + y[i];

}

// Invoke parallel SAXPY kernel with 256 threads/block

int nblocks = (n + 255) / 256;

saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);

Standard C Code

Parallel C Code

Page 9: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Hierarchy of concurrent threads

 Parallel kernels composed of many threads   all threads execute the same sequential program

  Threads are grouped into thread blocks   threads in the same block can cooperate

  Threads/blocks have unique IDs

Thread t

t0 t1 … tB

Block b

Kernel foo()

. . .

Page 10: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Hierarchical organization

Thread

per-thread local memory

Block

per-block shared memory

Kernel 0

. . . per-device

global memory

. . .

Kernel 1

. . . Global barrier

Local barrier

Page 11: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Heterogeneous Programming

 CUDA = serial program with parallel kernels, all in C   Serial C code executes in a CPU thread   Parallel kernel C code executes in thread blocks

across multiple processing elements

Serial Code

. . .

. . .

Parallel Kernel foo<<< nBlk, nTid >>>(args);

Serial Code

Parallel Kernel bar<<< nBlk, nTid >>>(args);

Page 12: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Thread = virtualized scalar processor

  Independent thread of execution   has its own PC, variables (registers), processor state, etc.   no implication about how threads are scheduled

 CUDA threads might be physical threads   as on NVIDIA GPUs

 CUDA threads might be virtual threads   might pick 1 block = 1 physical thread on multicore CPU

Page 13: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Block = virtualized multiprocessor

 Provides programmer flexibility   freely choose processors to fit data   freely customize for each kernel launch

  Thread block = a (data) parallel task   all blocks in kernel have the same entry point   but may execute any code they want

  Thread blocks of kernel must be independent tasks   program valid for any interleaving of block executions

Page 14: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Blocks must be independent

 Any possible interleaving of blocks should be valid   presumed to run to completion without pre-emption   can run in any order   can run concurrently OR sequentially

 Blocks may coordinate but not synchronize   shared queue pointer: OK   shared lock: BAD … can easily deadlock

  Independence requirement gives scalability

Page 15: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Scalable Execution Model

Kernel launched by host

. . .

SP

Shared Memory

MT IU

SP

Shared Memory

MT IU

SP

Shared Memory

MT IU

SP

Shared Memory

MT IU

SP

Shared Memory

MT IU

SP

Shared Memory

MT IU

SP

Shared Memory

MT IU

SP

Shared Memory

MT IU

. . .

Device Memory

Blocks Run on Multiprocessors

Page 16: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Synchronization & Cooperation

  Threads within block may synchronize with barriers … Step 1 …

__syncthreads(); … Step 2 …

 Blocks coordinate via atomic memory operations   e.g., increment shared queue pointer with atomicInc()

  Implicit barrier between dependent kernels vec_minus<<<nblocks, blksize>>>(a, b, c);

vec_dot<<<nblocks, blksize>>>(c, c);

Page 17: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Using per-block shared memory

 Variables shared across block __shared__ int *begin, *end;

 Scratchpad memory __shared__ int scratch[blocksize]; scratch[threadIdx.x] = begin[threadIdx.x];

// … compute on scratch values … begin[threadIdx.x] = scratch[threadIdx.x];

 Communicating values between threads scratch[threadIdx.x] = begin[threadIdx.x]; __syncthreads();

int left = scratch[threadIdx.x - 1];

Block

Shared

Page 18: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Example: Parallel Reduction

 Summing up a sequence with 1 thread: int sum = 0; for(int i=0; i<N; ++i) sum += x[i];

 Parallel reduction builds a summation tree   each thread holds 1 element   stepwise partial sums   N threads need log N steps

  one possible approach: Butterfly pattern

Page 19: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Example: Parallel Reduction

 Summing up a sequence with 1 thread: int sum = 0; for(int i=0; i<N; ++i) sum += x[i];

 Parallel reduction builds a summation tree   each thread holds 1 element   stepwise partial sums   N threads need log N steps

  one possible approach: Butterfly pattern

Page 20: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Parallel Reduction for 1 Block

// INPUT: Thread i holds value x_i int i = threadIdx.x; __shared__ int sum[blocksize];

// One thread per element sum[i] = x_i; __syncthreads();

for(int bit=blocksize/2; bit>0; bit/=2) { int t=sum[i]+sum[i^bit]; __syncthreads(); sum[i]=t; __syncthreads(); } // OUTPUT: Every thread now holds sum in sum[i]

Page 21: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Summing Up CUDA

 CUDA = C + a few simple extensions   makes it easy to start writing basic parallel programs

  Three key abstractions: 1.  hierarchy of parallel threads 2.  corresponding levels of synchronization 3.  corresponding memory spaces

 Supports massive parallelism of manycore GPUs

Page 22: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Parallel

// INPUT: Thread i holds value x_i int i = threadIdx.x; __shared__ int sum[blocksize];

// One thread per element sum[i] = x_i; __syncthreads();

for(int bit=blocksize/2; bit>0; bit/=2) { int t=sum[i]+sum[i^bit]; __syncthreads(); sum[i]=t; __syncthreads(); } // OUTPUT: Every thread now holds sum in sum[i]

Page 23: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

More efficient reduction tree

template<class T> __device__ T reduce(T *x) { unsigned int i = threadIdx.x; unsigned int n = blockDim.x;

for(unsigned int offset=n/2; offset>0; offset/=2) { if(tid < offset) x[i] += x[i + offset]; __syncthreads(); }

return x[i]; // Note that only thread 0 has full sum }

Page 24: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Reduction tree execution example

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2

Input (shared memory)

0 1 2 3 4 5 6 7

8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2

0 1 2 3

8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

0 1

21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

0

41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

Final result

active threads

x[i] += x[i+8];

x[i] += x[i+4];

x[i] += x[i+2];

x[i] += x[i+1];

Page 25: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Pattern 1: Fit kernel to the data

 Code lets B-thread block reduce B-element array

  For larger sequences:   launch N/B blocks to reduce each B-element subsequence   write N/B partial sums to temporary array   repeat until done

 We’ll be done in logB N steps

Page 26: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Pattern 2: Fit kernel to the machine

  For a block size of B: 1) Launch B blocks to sum N/B elements 2) Launch 1 block to combine B partial sums

 We’ll be done in 2 steps

 BUT ... need to make sure B blocks fill machine

4 7 5 9 11 14

25

3 1 7 0 4 1 6 3 4 7 5 9

11 14 25

3 1 7 0 4 1 6 3 4 7 5 9

11 14 25

3 1 7 0 4 1 6 3 4 7 5 9

11 14 25

3 1 7 0 4 1 6 3 4 7 5 9

11 14 25

3 1 7 0 4 1 6 3 4 7 5 9

11 14 25

3 1 7 0 4 1 6 3 4 7 5 9

11 14 25

3 1 7 0 4 1 6 3 4 7 5 9

11 14 25

3 1 7 0 4 1 6 3

4 7 5 9 11 14

25

3 1 7 0 4 1 6 3

Stage 1: many blocks

Stage2: 1 block

Page 27: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Pattern 3: Atomic accumulation

 Create a variable __device__ total;

 At the end of each block, call atomicAdd() to accumulate partial sum into total.

 We’ll be done in 1 step

 BUT …   we’ve created a global serialization point between blocks   contention may be high, esp. for small block sizes   atomics support only limited types / operators

Page 28: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

THINK DATA-PARALLEL: SCAN

Page 29: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

• © NVIDIA Corporation 2007

Common Situations in Parallel Computation

 Many parallel threads need to generate a single result value  Reduce

 Many parallel threads that need to partition data  Split

 Many parallel threads and variable output per thread  Compact / Expand / Allocate

Page 30: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

• © NVIDIA Corporation 2007

F T F F T F F T

F F F F F T T T

3 6 1 4 0 7 1 3

3 1 4 7 1 6 0 3

Flag

Payload

Split Operation

 Given an array of true and false elements (and payloads)

 Return an array with all true elements at the beginning

 Examples: sorting, building trees

Page 31: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

• © NVIDIA Corporation 2007

Variable Output Per Thread: Compact

3 7 4 1 3

3 0 7 0 4 1 0 3

 Remove null elements

Page 32: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

• © NVIDIA Corporation 2007

Variable Output Per Thread

 Allocate Variable Storage Per Thread

 Examples: marching cubes, geometry generation

A

B

C D

E

F

G

2 1 0 3 2

H

Page 33: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

• © NVIDIA Corporation 2007

“Where do I write my output?”

 In all of these situations, each thread must answer that simple question

 The answer is:

“That depends (on how much the other threads need to write)!”

 “Scan” is an efficient way to answer this question in parallel

Page 34: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

• © NVIDIA Corporation 2007

Applications of Scan

 Scan is a simple and useful parallel building block for many parallel algorithms:

 Fascinating, since scan is unnecessary in sequential computing!

 radix sort  quicksort (segmented

scan)  String comparison  Lexical analysis  Stream compaction  Run-length encoding

 Polynomial evaluation  Solving recurrences  Tree operations  Histograms  Allocation  Etc.

Page 35: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Parallel Scan (a.k.a. prefix sum)

 Given sequence A of n elements and an associative operator ⊕ with identity I

Inclusive scan: scan(A, ⊕) = [ A0, A0 ⊕ A1, …, A0 ⊕ ··· ⊕ An-1]

Exclusive scan: prescan(A, ⊕) = [ I, A0, …, A0 ⊕ ··· ⊕ An-2]

Ex: A = [3 1 7 0 4 1 6 3] prescan(A,+) = [0 3 4 11 11 15 16 22]

Page 36: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Inclusive Scan across 1 block

// After Hillis & Steele, Data parallel algorithms, CACM 1986. template<class T> __device__ T plus_scan(T *x) { unsigned int i = threadIdx.x; unsigned int n = blockDim.x;

for(unsigned int offset=1; offset<n; offset *= 2) { T t; if(i>=offset) t = x[i-offset]; __syncthreads(); if(i>=offset) x[i] = t + x[i]; __syncthreads(); } }

Page 37: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Combining results across blocks

 Much like reduction case  Pattern 1: Fit kernel to the data … O(logB N) steps  Pattern 2: Fit kernel to the machine … O(1) steps

 Now need to add partial sums back to partial results  kernels need to save partial results  need additional kernel to add partial sums back in

 Sketch of algorithm using pattern 2: 1)  B blocks compute partial results for subsequences 2)  1 block performs scan of partial sums from (1) 3)  N/B blocks add results from (2) into results from (1)

Page 38: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Directions for improvement

 Templating for generic operators  scan<plus>() much more powerful than plus_scan()

 Pay attention to operator properties  associative? commutative? has inverse? has identity?  the fewer that are true, the more careful the code must be  e.g., reduction examples assume commutative w/ identity

 Many opportunities for code optimization  remove inefficiencies  (auto-) tuning for different cases

Page 39: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Directions for optimization

 Loop unrolling with fixed block sizes  almost always a good idea

 Minimize unutilized threads  reduction example leaves half its threads idle

 Improve work efficiency  scan example does O(n log n) work

 Watch out for divergence and incoherent load/store  sometimes at odds with earlier directions

Page 40: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

GPU ARCHITECTURE

Page 41: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

GPU ARCHITECTURE

Communication Fabric

Mem

ory & I/O

Fi

xed

Func

tion

Acc

eler

atio

n

240 scalar cores

On-chip memory

Texture units

GeForce GTX 280 / Tesla T10

Page 42: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

SM

DP

SFU

MT Issue

Inst. Cache

Const. Cache

SFU

Memory

SP SP SP SP SP SP SP SP

Key Architectural Ideas

 Massive multithreading   Hide latency with computation not cache

 SIMT execution (Single Instruction Multiple Thread)

  Warps – threads run in groups of 32

  HW handles divergence gracefully

SM

Page 43: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Reduction tree redux

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2

Input (shared memory)

0 1 2 3 4 5 6 7

8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2

0 1 2 3

8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

0 1

21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

0

41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

Final result

active threads

x[i] += x[i+8];

x[i] += x[i+4];

x[i] += x[i+2];

x[i] += x[i+1];

Page 44: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Compare to interleaved addressing: Input (shared memory)

x[i] += x[i+8];

x[i] += x[i+4];

x[i] += x[i+2];

x[i] += x[i+1];

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2

0 1 2 3 4 5 6 7

11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2

0 1 2 3

18 1 7 -1 6 -2 8 5 4 -3 9 7 13 11 2 2

0 1

24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2

0

41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2

Page 45: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

SOME FINAL THOUGHTS

Faster is not “just faster” -- David Kirk, Chief Scientist

  2-3X faster is “just faster”   Do a little more, wait a little less   Doesn’t change how you work

  5-10x faster is “significant”   Worth upgrading   Worth re-writing (parts of) the application

  100x (or more!) faster is “fundamentally different”   Worth considering a new platform   Worth re-architecting the application   Makes new applications possible   Drives “time to discovery”, changes science

Page 46: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

SOME FINAL THOUGHTS

  We should teach parallel computing in CS 1 or CS 2

  Remember: computers don’t get faster, just wider

  Heapsort and mergesort

  Both O(n lg n)

  One parallel-friendly, one not

  Students need to understand this early

Page 47: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Conclusion

 GPUs are massively parallel manycore computers   Ubiquitous - most successful parallel processor in history   Useful - users achieve huge speedups on real problems

 CUDA is a powerful parallel programming model   Heterogeneous - mixed serial-parallel programming   Scalable - hierarchical thread execution model   Accessible - minimal but expressive changes to C

  They provide tremendous scope for innovative research

Page 48: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

Questions?

Page 49: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

Copyright NVIDIA 2008 49

Questions & Discussion

  Lots more slides…

  CUDA programming   Advanced examples: sparse matrix-vector, ray tracing

  Performance: tools and applications   BLAS, FFT, sort, sparse matrix-vector   Fluid dynamics, molecular dynamics, n-body, discontinuous Galerkin

  GPU architecture

  Myths of GPU Computing

  Otherwise: [email protected]

Page 50: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

Example: Sparse Matrix-Vector

Page 51: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Sparse matrix-vector multiplication

 Sparse matrices have relatively few non-zero entries  Frequently O(n) rather than O(n2)  Only store & operate on these non-zero entries

3 0 1 0

0 0 0 0

0 2 4 1

1 0 0 1

Av[7] = { 3, 1, 2, 4, 1, 1, 1 };

Aj[7] = { 0, 2, 1, 2, 3, 0, 3 };

Ap[5] = { 0, 2, 2, 5, 7 };

Non-zero values

Column indices

Row pointers

Row 0 Row 2 Row 3

Example: Compressed Sparse Row (CSR) Format

Page 52: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Sparse matrix-vector multiplication

float multiply_row(uint rowsize, // number of non-zeros in row uint *Aj, // column indices for row float *Av, // non-zero entries for row float *x) // the RHS vector { float sum = 0;

for(uint column=0; column<rowsize; ++column) sum += Av[column] * x[Aj[column]];

return sum; }

Av[7] = { 3, 1, 2, 4, 1, 1, 1 };

Aj[7] = { 0, 2, 1, 2, 3, 0, 3 };

Ap[5] = { 0, 2, 2, 5, 7 };

Non-zero values

Column indices

Row pointers

Row 0 Row 2 Row 3

Page 53: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Sparse matrix-vector multiplication

float multiply_row(uint size, uint *Aj, float *Av, float *x);

void csrmul_serial(uint *Ap, uint *Aj, float *Av, uint num_rows, float *x, float *y) { for(uint row=0; row<num_rows; ++row) { uint row_begin = Ap[row]; uint row_end = Ap[row+1];

y[row] = multiply_row(row_end-row_begin, Aj+row_begin, Av+row_begin, x); } }

Page 54: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Sparse matrix-vector multiplication

float multiply_row(uint size, uint *Aj, float *Av, float *x);

__global__ void csrmul_kernel(uint *Ap, uint *Aj, float *Av, uint num_rows, float *x, float *y) { uint row = blockIdx.x*blockDim.x + threadIdx.x;

if( row<num_rows ) { uint row_begin = Ap[row]; uint row_end = Ap[row+1];

y[row] = multiply_row(row_end-row_begin, Aj+row_begin, Av+row_begin, x); } }

Page 55: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Compare with OpenMP void csrmul_openmp(… … … … … …) { #pragma omp parallel for for(uint row=0; row<num_rows; ++row) { uint row_begin = Ap[row]; uint row_end = Ap[row+1];

y[row] = multiply_row(… … … …); } }

OpenMP Kernel

__global__ void csrmul_kernel(… … … … … …) { uint row = blockIdx.x*blockDim.x + threadIdx.x;

if( row<num_rows ) { uint row_begin = Ap[row]; uint row_end = Ap[row+1];

y[row] = multiply_row(… … … …); } }

CUDA Kernel

Page 56: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Adding a simple caching scheme

__global__ void csrmul_cached(… … … … … …) { uint begin = blockIdx.x*blockDim.x, end = begin+blockDim.x; uint row = begin + threadIdx.x;

__shared__ float cache[blocksize]; // array to cache rows if( row<num_rows) cache[threadIdx.x] = x[row]; // fetch to cache __syncthreads();

if( row<num_rows ) { uint row_begin = Ap[row], row_end = Ap[row+1]; float sum = 0;

for(uint col=row_begin; col<row_end; ++col) { uint j = Aj[col];

// Fetch from cached rows when possible float x_j = (j>=begin && j<end) ? cache[j-begin] : x[j];

sum += Av[col] * x_j; }

y[row] = sum; } }

multiply_row()

Page 57: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

Example: Scan

Page 58: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

Example: Multigrid

Poisson Solver

Page 59: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Results: Multigrid Poisson Solver

do k = 1, nz do j = 1, ny do i = 1+mod(k+j+redblack,2), nx, 2 residual = -hsq*B(i,j,k) - U(I,j,k)*6 + & (U(i+1,j,k) + U(i-1,j,k)) + & (U(i,j+1,k) + U(i,j-1,k)) + & (U(i,j,k-1) + U(i,j,k+1)) U(i,j,k) = U(i,j,k) + (omega/6)*residual end do end do end do

Gauss-Seidel Red-Black smoother main loop - FORTRAN

Page 60: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Results: Multigrid Poisson Solver

__global__ void Relax( Grid U, Grid B, float omega, int red_black) { int i = blockIdx.x*blockDim.x + threadIdx.x; int j = blockIdx.y*blockDim.y + threadIdx.y; int k = blockIdx.z*blockDim.z + threadIdx.z; if ((j+k+i)%2 != red_black) return; int idx = i + j*U.jstride + k*U.kstride; float residual = -U.hsq*B.buf[idx] - 6.0f*U.buf[idx] + U.buf[idx + U.istride] + U.buf[idx – U.istride] + U.buf[idx + U.jstride] + U.buf[idx – U.jstride] + U.buf[idx + 1 ] + U.buf[idx – 1 ]; U.buf[idx] += omega * residual/6.0f; }

Gauss-Seidel Red-Black smoother kernel - CUDA

Page 61: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Results: Multigrid Poisson Solver

 Optimal version is slightly more complex  Fast math intrinsics, optimized memory layout, etc.

 Single precision  1 step of full multigrid  GTX280 vs 3.16 GHz Core2Duo E8500 (using 1 core)

GTX280 Core2 E8500

32 x 32 x 32 0.01 sec 0.02 sec

64 x 64 x 64 0.02 sec 0.14 sec

128 x 128 x 128 0.06 sec 1.16 sec

256 x 256 x 256 0.33 sec 9.27 sec

Page 62: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

Example: Vector Add w/ Host Code

Page 63: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Example: Vector Addition Kernel

// Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; }

int main() { // Run N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C); }

Device Code

Page 64: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Example: Vector Addition Kernel

// Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; }

int main() { // Run N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C); }

Host Code

Page 65: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Example: Host code for vecAdd

// allocate and initialize host (CPU) memory float *h_A = …, *h_B = …;

// allocate device (GPU) memory float *d_A, *d_B, *d_C; cudaMalloc( (void**) &d_A, N * sizeof(float)); cudaMalloc( (void**) &d_B, N * sizeof(float)); cudaMalloc( (void**) &d_C, N * sizeof(float));

// copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float),

cudaMemcpyHostToDevice) ); cudaMemcpy( d_B, h_B, N * sizeof(float),

cudaMemcpyHostToDevice) );

// execute the kernel on N/256 blocks of 256 threads each vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);

Page 66: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

Example: Ray Tracing

Page 67: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

Copyright NVIDIA 2008 67

Implications & Strategy

  GTX 280 supports up to 30,720 concurrent threads!

  Implication: minimize state per thread

  Strategy: replace traversal stack with short stack

Horn et al., Interactive k-D Tree GPU Raytracing, I3D 2008

The following slides are borrowed from Daniel Horn’s presentation

Page 68: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

Copyright NVIDIA 2008 68

A

D C

KD-Tree

B

X

Y

Z

X

Y Z

A B C D

tmin

tmax

Page 69: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

Copyright NVIDIA 2008 69

D C

A

B

X

Y

Z

KD-Tree Traversal

X

Y Z

A B C D

Z A

Stack:

Page 70: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

Copyright NVIDIA 2008 70

D C

A

B

X

Y

Z

KD-Restart

  Standard traversal   Omit stack operations   Proceed to 1st leaf

  If no intersection   Advance (tmin,tmax)   Restart from root

  Proceed to next leaf

Page 71: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

Copyright NVIDIA 2008 71

D C

A

B

X

Y

Z

KD-Restart with short stack (size 1)

X

Y Z

A B C D

Z A

Stack: A

Page 72: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

Copyright NVIDIA 2008 72

Short Stack Cache

  Even better:

  Each thread stores full stack in memory non-blocking writes

  Cache top of stack locally (registers or shared memory)

  Enables BVHs as well as k-d trees

  5-10% faster in our current implementation

Page 73: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

Products

Page 74: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Tesla S1070 1U System

1 single precision

2 typical power

4 Teraflops1

800 watts2

Page 75: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Tesla C1060 Board

1 single precision

2 typical power

957 Gigaflops1

160 Watts2

Page 76: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Building a 100TF datacenter

CPU 1U Server Tesla 1U System

10x lower cost

21x lower power

4 CPU cores 0.07 Teraflop

$ 2000

400 W

1429 CPU servers

$ 3.1 M

571 KW

4 GPUs: 960 cores 4 Teraflops

$ 8000

800 W

25 CPU servers 25 Tesla systems

$ 0.31 M

27 KW

Page 77: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Tesla Personal Supercomputer

Supercomputing Performance   Massively parallel CUDA Architecture   960 cores. 4 TeraFlops   250x the performance of a desktop

Personal   One researcher, one supercomputer   Plugs into standard power strip

Accessible   Program in C for Windows, Linux   Available now worldwide under $10,000

Page 78: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

CUDA SDK

NVIDIA C Compiler

NVIDIA Assembly for Computing CPU Host Code

Integrated CPU and GPU C Source Code

Libraries:FFT, BLAS,… Example Source Code

CUDA Driver

Debugger Profiler Standard C Compiler

GPU CPU

Page 79: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

Applications

Page 80: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Example: Single Precision BLAS

0

50

100

150

200

250

300

350

GFL

OPS

Matrix Size

BLAS (SGEMM) on CUDA

CUDA

ATLAS 1 Thread ATLAS 4 Threads

CUBLAS:CUDA2.0b2,TeslaC1060(10‐seriesGPU)ATLAS3.81onDual2.8GHzOpteronDual‐Core

Page 81: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Example: Double Precision BLAS

0

10

20

30

40

50

60

70

GFL

OPS

Matrix Size

BLAS (DGEMM) on CUDA

CUBLAS

ATLAS Parallel

ATLAS Single

CUBLASCUDA2.0b2onTeslaC1060(10‐series)ATLAS3.81onIntelXeonE5440Quad‐core,2.83GHz

Page 82: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

DGEMM Performance

GPU performance includes data copies over PCIe gen-2

Page 83: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

CUDA-Simple: 14.8x faster,

33% of fastest GPU kernel

CUDA-Unroll8clx: fastest GPU kernel,

44x faster than CPU, 291 GFLOPS

on GeForce 8800GTX

GPU computing. J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, J. Phillips. Proceedings of the IEEE, 2008. In press.

CPU

http://www.ks.uiuc.edu/Research/namd/

Direct Coulomb Summation (VMD)

Page 84: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Fluid Dynamics

  2-D compressible Euler equations   irregular structured grid   finite volume method (Ni, 1981)

E. H. Phillips, Y. Zhang, R. L. Davis, J. D. Owens. A multi-grid solver for the 2D compressible Euler equations on a GPU cluster. http://www.ece.ucdavis.edu/cerl/techreports/2008-2/

Page 85: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Solving Euler Equations

 Divide domain into tiles & distribute amongst blocks

 As always, try for:   maximal work per tile   minimal data exchange between tiles   load balance within/between tiles

Page 86: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

1.6K 25K 400K 1.6M 6.4M

4 18 22 22 22

4 22

39 42 43

2

22

69 80 85

1 14

110

136

160

Domain Size (nodes)

Parallel Speedup vs. Single Core

QuadroFX 5600 2x QuadroFX 5600

4x QuadroFX 5600 8x QuadroFX 5600

Solving Euler Equations

Page 87: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Discontinuous Galerkin Methods

 Weak coupling between elements   limited intra-element communication   allows non-conforming meshes   allows heterogeneous element types

 Support local higher-order operations   increased arithmetic per element   but still local to the element

Page 88: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007 CPU code achieves ~3GFLOPS/core

Full time-dependent electromagnetic scattering in an open volume surrounding an aircraft

from T. Warburton, A. Klockner, J. Hesthaven

Page 89: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

We see just over 700 GFLOPS on 3xGTX 280

3

from T. Warburton, A. Klockner, J. Hesthaven

Multi-GPU Parallel Scaling

Page 90: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

Sparse Linear Algebra Results

Page 91: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

Sparse Matrix-Vector Multiplication (SpMV) on CUDA

 Experimented with several data structures   CSR: Compressed Sparse Row   HYB: Hybrid of ELLPACK (ELL) and Coordinate (COO)

formats

 HYB gave best results   Speed of ELL with flexibility of COO

 Benchmarked against matrices from   “Optimization of Sparse Matrix-Vector Multiplication on

Emerging Multicore Platforms”, S. Williams et al, Supercomputing 2007

Page 92: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Results: Sparse Matrix-Vector Multiplication (SpMV) on CUDA

CPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007

Page 93: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

Double Precision

Page 94: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

T10 Double Precision Floating Point

Precision IEEE 754

Rounding modes for FADD and FMUL All 4 IEEE, round to nearest, zero, inf, -inf

Denormal handling Full speed

NaN support Yes

Overflow and Infinity support Yes

Flags No

FMA Yes

Square root Software with low-latency FMA-based convergence

Division Software with low-latency FMA-based convergence

Reciprocal estimate accuracy 24 bit

Reciprocal sqrt estimate accuracy 23 bit

log2(x) and 2^x estimates accuracy 23 bit

Page 95: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

NVIDIA Tesla T10 SSE2 Cell SPE Precision IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADD and FMUL

All 4 IEEE, round to nearest, zero, inf, -inf

All 4 IEEE, round to nearest, zero, inf, -inf

All 4 IEEE, round to nearest, zero, inf, -inf

Denormal handling Full speed Supported, costs 1000’s of cycles

Supported only for results, not input operands (input denormals flushed-to-0)

NaN support Yes Yes Yes

Overflow and Infinity support Yes Yes Yes

Flags No Yes Yes

FMA Yes No Yes

Square root Software with low-latency FMA-based convergence Hardware Software only

Division Software with low-latency FMA-based convergence Hardware Software only

Reciprocal estimate accuracy 24 bit 12 bit 12 bit + step

Reciprocal sqrt estimate accuracy 23 bit 12 bit 12 bit + step

log2(x) and 2^x estimates accuracy 23 bit No No

Double Precision Floating Point

Page 96: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2007

Single Precision Floating Point G80 SSE IBM Altivec Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADD and FMUL

Round to nearest and round to zero

All 4 IEEE, round to nearest, zero, inf, -inf

Round to nearest only

Round to zero/truncate only

Denormal handling Flush to zero Supported, 1000’s of cycles

Supported, 1000’s of cycles Flush to zero

NaN support Yes Yes Yes No

Overflow and Infinity support

Yes, only clamps to max norm Yes Yes No, infinity

Flags No Yes Yes Some

Square root Software only Hardware Software only Software only

Division Software only Hardware Software only Software only

Reciprocal estimate accuracy 24 bit 12 bit 12 bit 12 bit

Reciprocal sqrt estimate accuracy 23 bit 12 bit 12 bit 12 bit

log2(x) and 2^x estimates accuracy 23 bit No 12 bit No

Page 97: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

Quotes

Page 98: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

GPUs have evolved to the point where many real world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs.

Jack Dongarra Professor, University of Tennessee Author of Linpack

Page 99: IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

© NVIDIA Corporation 2008

We’ve all heard ‘desktop supercomputer’ claims in the past, but this time it’s for real: NVIDIA and its partners will be delivering outstanding performance and broad applicability to the mainstream marketplace.

Heterogeneous computing is what makes such a breakthrough possible.

Burton Smith Technical Fellow, Microsoft Formerly, Chief Scientist at Cray