CUDA on BioHPC

CUDA on BioHPCApplications & Parallel Programming with GPUs

1 Updated for 2015-08-19

Why are CUDA and GPUs Useful?

2

Tesla K20 Speed-Up over Sandy Bridge CPUs

CPU results: Dual socket E5-2687w, 3.10 GHz, GPU results: Dual socket E5-2687w + 2 Tesla K20X GPUs

*MATLAB results comparing one i7-2600K CPU vs with Tesla K20 GPU

Disclaimer: Non-NVIDIA implementations may not have been fully optimized

0.0x 5.0x 10.0x 15.0x 20.0x

AMBER

SPECFEM3D

Chroma

MATLAB (FFT)*Engineering

Earth

Science

Physics

Molecular

Dynamics

© NVIDIA 2013

Molecular Dynamics (GROMACS / AMBER…)

3

Folding@Home

Most famous distributed HPC effort

Uses GROMACS core to run simulations

122,784 active CPUs 472 TFLOPS

47,226 active GPUs 14,183 TFLOPS

On average, each GPU is as good as 78 CPUs

(Total GPU capacity equivalent to ~6500 of our GPU nodes!!)

DNA Sequence Alignment

4

MUMmer GPU 2.0

Align 1,000,000 simulated 150 bp reads to E. Coli K12 genome sequence.

CPU K20 GPU

20.09s90.6s

4.5x speedup

Phylogenetics

5

Ribosomal Proteins - 48 taxa, 11949 character, 8 chains

CPU K20 GPU

66s389s

5.9x speedup

MrBayes 3.2.5 (CPU single node parallel) vs GPUMrBayes 3.1.2 (GPU K20)

Analysis time for 8000 generations

GROMACS on BioHPC

6

Lysozyme in water tutorialCPU K20 GPU

57.48 ns/day36.6 ns/day

1.6x speedup

(Very simple system. Speed-up will grow for more complex systems ~3-4x max)

What are GPUs? What is CUDA?

Massively parallel code, running on a Graphics Processing Unit

What are GPUs? What is CUDA

8

Application Code

GPU CPU

Use GPU to Parallelize

Compute-Intensive Functions

Rest of SequentialCPU Code

© NVIDIA 2013

CUDA on BioHPC - Hardware

9

Nucleus042-43 Tesla K40

Nucleus044-49 Tesla K20

CUDA on BioHPC - Hardware

10

BioHPC Workstations (Dell Precision Towers)

NVIDIA QUADRO 600/K600/K620

1/1/2GB of RAM (shared with display)

96/192/384 cores

Good for developing code.Not always faster than CPU code on these machines (esp. using Intel compiler and MKL)

11

Applications

Libraries

“Drop-in”

Acceleration

Programming Languages

OpenACC Directives

Easily Accelerate

Applications

Maximum

Flexibility

How to use CUDA

GPU Software – BioHPC Portal

12

List of GPU-supporting software will be added to the BioHPC portal soon!

CUDA on BioHPC - Software

13

module load cuda65 NVIDIA CUDA toolkitFor writing and building CUDA C/C++/FortranLibraries - cuBLAS, thrust etc.

module load cuda65/nsight CUDA Debugging / profiling

Also various software available with GPU support:

pycuda in python/xxx-anaconda

gputools in R/3.2.1-intel

Parallel Computing Toolbox in matlab

GPU support in apps – e.g. gromacs, various SBGRID tools etc.

GPU-Accelerated Libraries

14

NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP

Vector SignalImage Processing

GPU AcceleratedLinear Algebra

Matrix Algebra on GPU and Multicore NVIDIA cuFFT

C++ STL Features for CUDAIMSL Library

Building-block Algorithms for CUDA

ArrayFire Matrix Computations

Sparse Linear Algebra

http://code.google.com/p/thrust/downloads/list

http://code.google.com/p/thrust/downloads/list

Programming Languages

15

Worked Example – Matrix Multiplication

16

Let’s multiply matrices with:

• CPU & GPU naively• GPU with some optimization• CPU & GPU with MKL / cuBLAS

Which will be fastest?!

Why is matrix multiplication well suited to GPUs?

CPU vs GPU – What’s Different?

17

Few fast cores, very cleverMany slower cores and not as smartBut faster overall if you are clever!

CPU GPU

Running on the GPU – Load data into GPU RAM

18

1. Copy input data from CPU memory

to GPU memory

PCI Bus

© NVIDIA 2013

Running on the GPU – Parallel Code Execution

19


to GPU memory

2. Load GPU program and execute,

manually caching data in shared

memory for performance

© NVIDIA 2013

PCI Bus

Running on the GPU – Retrieving the Results

20


to GPU memory

2. Load GPU program and execute,

manually caching data on shared

memory for performance

3. Copy results from GPU memory to

CPU memory

© NVIDIA 2013

PCI Bus

Matrix Multiplication

21

Each output value can be independently computed.

We can chop the problem up into manageable pieces to run in parallel.

How do we do this on the GPU?

Grids, Blocks, Threads, Warps, Stream Processors – Oh my!

22

A problem needs the same calculation(s) performed N times on different data.

The calculation needed for a single output is implemented as a kernel.

The entire problem is represented in a 1/2/3D grid structure.

The grid is split up into smaller blocks, to fit the architecture of the GPU

Grids, Blocks, Threads, Warps, Stream Processors – Oh my!

23

Inside each block the calculation on each piece of data will be performed by a separate thread executing the kernel.

Threads in a block are run in groups called warps.

All threads in a warp run the same instruction at the same time, in parallel.

Blocks and their warps are scheduled across multiple stream processors.

CPU Matrix Multiplication (Naïve with OpenMP)

24

void cpu_naive_mmul(const float *A, const float *B, float *C,const int M, const int K, const int N) {

#pragma omp parallel forfor( int i=0; i < M; i++ ){

for( int j=0; j < N; j++){

float sum = 0.0;for( int k=0; k < K; k++ ){

sum += A[ (i * M) + k ] * B[ (k * N) + j];}

C[ (i * M) + j ] = sum;}

}

}

1_naive_mmult.cu

A GPU Matrix Multiplication Kernel (Naïve)

25

__global__ void gpu_naive_mmul(const float *A, const float *B,float *C, const int M, const int K, const int N){

int i = blockIdx.y * blockDim.y + threadIdx.y;int j = blockIdx.x * blockDim.x + threadIdx.x;

if( (i >= M) || (j >= N) ) return;

float sum = 0.0;

for( int k = 0; k <K; ++k) {sum += A[ (i * M) + k ] * B[ (k * N) + j];

}

C[ (i * M) + j ] = sum;

}

1_naive_mmult.cu

Initializing some matrices & copying to the GPU

26

// Allocate matrices on the hostfloat *h_A = (float *)malloc(nr_rows_A * nr_cols_A * sizeof(float));float *h_B = (float *)malloc(nr_rows_B * nr_cols_B * sizeof(float));float *h_C = (float *)malloc(nr_rows_C * nr_cols_C * sizeof(float));

// Fill the input matrices A and B with numbersinit_const( h_A, nr_cols_A * nr_rows_A );init_const( h_B, nr_cols_B * nr_rows_B );

// Allocate matrices on the GPUfloat *d_A, *d_B, *d_C;cudaMalloc(&d_A,nr_rows_A * nr_cols_A * sizeof(float));cudaMalloc(&d_B,nr_rows_B * nr_cols_B * sizeof(float));cudaMalloc(&d_C,nr_rows_C * nr_cols_C * sizeof(float));

// Copy the matrices to the GPUcudaMemcpy(d_A, h_A, nr_rows_A * nr_cols_A * sizeof(float), cudaMemcpyHostToDevice);cudaMemcpy(d_B, h_B, nr_rows_B * nr_cols_B * sizeof(float), cudaMemcpyHostToDevice);

1_naive_mmult.cu

Running the Kernel

27

// Setup the size of our blocks and griddim3 dimBlock(16,16);dim3 dimGrid( nr_cols_B/dimBlock.x, nr_cols_A/dimBlock.y );

// Perform the actual multiplication on the GPU with cuBLASgpu_naive_mmul<<<dimGrid, dimBlock>>>(

d_A, d_B, d_C, nr_rows_A, nr_cols_A, nr_cols_B);

// Wait for the GPU to finishcudaDeviceSynchronize();

1_naive_mmult.cu

Getting the result, and cleaning up

28

// Copy from the device to the hostcudaMemcpy(h_C, d_C,

nr_rows_C * nr_cols_C * sizeof(float),cudaMemcpyDeviceToHost);

//Free GPU memorycudaFree(d_A);cudaFree(d_B);cudaFree(d_C);

1_naive_mmult.cu

How Fast?

29

Multiplying 4096x4096 square matrices of floating point values.

16,777,216 output values, each requiring 4,096 multiplications and 4,096 additions.

~ 137 GFLOP – Billion floating point operations.

CPU Routine takes 6.5s ~ 21 GFLOP/s

GPU Routine takes 1.7s ~ 82 GFLOP/s + 0.056s for data transfer

Each floating point value is 4 bytes.

Memory usage is 3 * 4096 * 4096 *4 = 192MB

Speeding up the GPU Kernel

30

Fast AccessKB of space

(48Kb on our GPUs)

Slow AccessGB of space

(6/12 GB on our GPUs)

Speeding up the GPU Kernel

31

Each block generates some of the output matrix

Doesn’t need all of the input matrices.

Put the bits it needs into fast shared memory!

Tiled Matrix Multiplication Kernel

32

__global__ void gpu_tiled_mmul(const float *A, const float *B, float *C, const int wA, const int wB){

// Block index & Thread Indexint bx = blockIdx.x; int by = blockIdx.y;int tx = threadIdx.x; int ty = threadIdx.y;// Index of the first sub-matrix of A processed by the blockint aBegin = wA * BLOCK_SIZE * by;// Index of the last sub-matrix of A processed by the blockint aEnd = aBegin + wA - 1;// Step size used to iterate through the sub-matrices of Aint aStep = BLOCK_SIZE;// Index of the first sub-matrix of B processed by the blockint bBegin = BLOCK_SIZE * bx;// Step size used to iterate through the sub-matrices of Bint bStep = BLOCK_SIZE * wB;

// Loop over all the sub-matrices of A and B// required to compute the block sub-matrixfor (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) {

// Shared memory for tiles from A and B__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load the matrices from device memory// to shared memory; each thread loads// one element of each matrixAs[ty][tx] = A[a + wA * ty + tx];Bs[ty][tx] = B[b + wB * ty + tx];

2_tiled_mmult.cu

Tiled Matrix Multiplication Kernel

33

// Synchronize to make sure the matrices are loaded__syncthreads();

// Multiply the two matrices together;// each thread computes one element// of the block sub-matrix

#pragma unrollfor (int k = 0; k < BLOCK_SIZE; ++k){

Csub += As[ty][k] * Bs[k][tx];}

// Synchronize to make sure that the preceding// computation is done before loading two new// sub-matrices of A and B in the next iteration__syncthreads();

}

// Write the block sub-matrix to device memory;// each thread writes one elementint c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;C[c + wB * ty + tx] = Csub;

}

2_tiled_mmult.cu

How Much Faster?

34

Naïve GPU 1.7s ~ 82 GFLOP/s

Tiled GPU 0.673s ~ 204 GFLOP/s

Milliseconds Copy To Execute Copy Back

GPU Naive 36 1653 20GPU Tiled 37 673 20

2_tiled_mmult.cu

What About Libraries?

35

BLAS is a widely used library for linear algebra

Intel MKL is an optimized implementation for Intel CPUs

MKL BLAS 0.464s ~ 295 GFLOP/s

// Intel MKL headers (for MKL BLAS)#include <mkl.h>

void cpu_blas_mmul(const float *A, const float *B, float *C,const int m, const int k, const int n) {

cblas_sgemm( CblasRowMajor, CblasNoTrans, CblasNoTrans,m, n, k, 1.0, A, m, B, k, 0.0, C, m);

}

3_blas_mmult.cu

cuBLAS

36

cuBLAS 0.062s ~ 2.2 TFLOP/s

void gpu_blas_mmul(const float *A, const float *B, float *C, const int m, constint k, const int n) {

const float alf = 1;const float bet = 0;const float *alpha = &alf;const float *beta = &bet;

// Create a handle for CUBLAScublasHandle_t handle;cublasCreate(&handle);

// Do the actual multiplicationcublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, alpha, A, m, B, k,

beta, C, m);

// Destroy the handlecublasDestroy(handle);

}

3_blas_mmult.cu

Comparing All the Methods……

37

6491 464 1653 673 62

CPU Naive OMP CPU MKL BLAS GPU Naive GPU Tiled GPU cuBLAS

0

1000

2000

3000

4000

5000

6000

7000

Mill

ise

con

ds

~7.5x Speedup

4096x4096 square matrix multiplication

Thrust – GPU Implementations of Standard Algorithms

38

#include <thrust/host_vector.h>#include <thrust/device_vector.h>#include <thrust/generate.h>#include <thrust/sort.h>

...

thrust::host_vector<int> h_vec(200000);thrust::generate(h_vec.begin(), h_vec.end(), rand);

// transfer to device and sortthrust::device_vector<int> d_vec = h_vec;thrust::sort(d_vec.begin(), d_vec.end());

// show resultthrust::host_vector<int> h_result = d_vec;std::cerr << "third item in sorted data:" << d_vec[2] << std::endl;

When should I use CUDA?

39

1) Amount of computation must be >> amount of memory transfer

2) Can represent problem as a 1/2/3D grid of independent blocks and threads

3) There are *many* independent calculationsK20 has 2,496 cores – need to fill them upMust have threads >> cores for good efficiency

4) Problem fits inside the smaller RAM on GPU – 6/12GB vs 128/256/384GBor subdivide the problem – but see 1

5) A library is available for my task, or I am comfortable thinking about computer architectures.

Help!

40

https://developer.nvidia.com/resources

Fantastic documentation, tutorials, example code, videos, forums…

Acknowledgements

41

Various figures and examples from:

MSDN Article – NVIDIA-GPU-Architecturehttps://code.msdn.microsoft.com/windowsdesktop/NVIDIA-GPU-Architecture-45c11e6dAlan TatourianApache License v2.0

NVIDIA CUDA Education Materialshttps://developer.nvidia.com/cuda-educationMark Ebersole, Mark Harris, Thomas Bradley – NVIDIA Corporation

https://code.msdn.microsoft.com/windowsdesktop/NVIDIA-GPU-Architecture-45c11e6d

https://developer.nvidia.com/cuda-education

Documents

CUDA on BioHPC