CUDA_May_09_STG_L4

7/30/2019 CUDA_May_09_STG_L4

http://slidepdf.com/reader/full/cudamay09stgl4 1/23

Developing Kernels: Part 2Algorithm Considerations, Multi-Kernel Programs and Optimization

Steven Gratton, Institute of Astronomy

7/30/2019 CUDA_May_09_STG_L4


• Want to break the problem up into multiple steps, each of which could

do with a large number of blocks and threads within blocks thrown at it

• How do we synchronize between steps?

• Remember __syncthreads() only works within a block…

• Key point:

the global memory persists on the gpu after a kernel finishes, and a new

kernel isn’t start until pending memory writes complete

7/30/2019 CUDA_May_09_STG_L4


Global Synchronization

• So, enforce global synchronization by stopping one kernel and starting

another one!

• A bit drastic, but it works!

• O(10^-5 s) kernel launch overhead

• Shared memory is erased

• No problem as long as you don’t have to do it too often!

7/30/2019 CUDA_May_09_STG_L4


So, we’ll need something like…

__global__ topleft(,);

__global__ strip(,);

__global__ therest(,);

for (n=0;n<SZ;n++) {

topleft<<<,>>>(n,);

strip<<<,>>>(n,);

therest<<<,>>>(n,);

}

7/30/2019 CUDA_May_09_STG_L4


Only worry about/optimize what matters!

• For us, therest<<<,>>>() is what matters

• As long as we don’t do anything silly, the other two basically just have to

work and no more

7/30/2019 CUDA_May_09_STG_L4


So, what can we do?

• Matrix suggests a tiled approach, each thread doing one element.

• Keeping our thread blocks square for simplicity, a natural choice is

16x16 (more on this later), so we’ll need a

dim3 Db(16,16);

Nb. 16x16=8x32, so no incomplete warps!

• But a problem is now that we have to subtract a vector times its

transpose

• Vectors are a bit unnatural when dealing with 16x16 tiles!

• Can we improve our algorithm?

7/30/2019 CUDA_May_09_STG_L4


Now treat our elements as tiles!

7/30/2019 CUDA_May_09_STG_L4


7/30/2019 CUDA_May_09_STG_L4


7/30/2019 CUDA_May_09_STG_L4


• i.e.

• Just another Cholesky factorization, smaller by one tile!

7/30/2019 CUDA_May_09_STG_L4


What have we gained?

• (N/16)^3 tiled operations altogether.

• Each of these is now an 16x16 matrix multiply, so 16^3.

• No change in the FLOPs.

• But what about memory?

• If each block can store the tiles it needs in shared memory, then it

only has to do 16x16 loads

• So saves a factor of 16 on the memory!!!

7/30/2019 CUDA_May_09_STG_L4


But why 16?

• A “magic number” in CUDA, for many reasons. We’ve already seen

that 16^2 = 32x8

• But here’s another one:

• On existing hardware, memory accesses are grouped by half-warp,

which is 16

• And global memory accesses are optimal (“coalesced ”) if all threads

in a half-warp access consecutive memory locations

Even more important for 1st gen hardware

(Now you can “shuffle” your accesses somewhat)

7/30/2019 CUDA_May_09_STG_L4


Sharing memory banks

• Remember, we’re caching the tiles in shared memory!

• Like global memory, shared memory is accessed per half-warp.

• Divided up into banks, currently 16, each a float wide.

• So __shared__float c[16][16]; has one column per bank

7/30/2019 CUDA_May_09_STG_L4


• Shared memory is quick if:

• All threads access a different bank

• C[?][tx] is fine

• All threads access the same element in a given bank

• C[ty][3] is fine

• But c[tx][3] is not

7/30/2019 CUDA_May_09_STG_L4


• Look at B’, need to load b^T and c

• Fine for the untransposed matrix, we’ll have something like

c[ty][tx]=m[ty][tx];

• but what about the transposed one? We need something like…

b[ty][tx]=m[ty][tx];

7/30/2019 CUDA_May_09_STG_L4


• To do the matrix multiply within a block, we’ll need

for (i=0;i<16;i++){

m[ty][tx]=m[ty][tx]-b[i][ty]*c[i][tx];

}

so we’re okay here

7/30/2019 CUDA_May_09_STG_L4


Digression: Loop unrolling

#pragma unroll 16

for (i=0;i<16;i++){

m[ty][tx]=m[ty][tx]-b[i][ty]*c[i][tx];

}

replaces the loop with 16 lines of code, saving loop overhead

7/30/2019 CUDA_May_09_STG_L4


What if you do have shared memory bank conflicts?

• Typically happen when accessing a 16x16 array columnwise rather than

rowwise

• Consider defining your array “a bit off”:

__shared__float c[16][17];

• Now not only is each element of a row in a different bank, but now each

element of a column is also!

• (just don’t get a half warp to access diagonal elements now…)

• Avoiding conflicts can be tricky if you’re dealing with e.g. doubles

7/30/2019 CUDA_May_09_STG_L4


So, let’s have a look through and give it a go!

…see cholesky.cu, transchol.cu, doublechol.cu

(Code available via www.ast.cam.ac.uk/~stg20/cuda/cholesky/)

7/30/2019 CUDA_May_09_STG_L4


What goes on “under the hood”?

• nvcc

• splits your .cu code into .c/.cpp code for the host compiler and code

for the device

• Compiles the device code into a virtual assembly language (.ptx)

• This then gets further compiled into proper executable machine code

for your gpu (.cubin)

• Everything then gets put back together and linked with the

appropriate libraries to give you your executable “a.out”

7/30/2019 CUDA_May_09_STG_L4


You can take a look:

nvcc –keep …

Looking at the ptx can give you a good idea of what is going on.

Nvidia haven’t disclosed the real ISA, but you can look at the .cubin file

and see some interesting things, like how many registers per thread and

how much shared memory your kernels need – the less of both, the

more blocks per multiprocessor you can run which helps hide memory

latency

7/30/2019 CUDA_May_09_STG_L4


Conclusions

• We’ve seen how that GPGPU is all about having many threads run the

same program

• Hardware constraints make this more complicated than it might be

• Seen how we can use multiple kernels if we need global

synchronization

• Choice of algorithm to implement crucial

• Looked at possible optimizations and what goes on beneath the surface

7/30/2019 CUDA_May_09_STG_L4


Further issues

• Need 6 warps per multiprocessor to hide any pipelining

• As well as shared memory bank conflicts, you can also have register

bank conflicts. The only advice is to have 64, 128, … threads per block.

• Global memory is partitioned round-robin between channels in 256-byte

chunks. Be careful to avoid all threads only accessing memory that lies

in one channel! Nvidia now recognize this and call it partition camping.

• nvcc -maxrregcount n can sometimes help

Documents

CUDA_May_09_STG_L4