41
CUDA on BioHPC Applications & Parallel Programming with GPUs 1 Updated for 2015-08-19

CUDA on BioHPC

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CUDA on BioHPC

CUDA on BioHPCApplications & Parallel Programming with GPUs

1 Updated for 2015-08-19

Page 2: CUDA on BioHPC

Why are CUDA and GPUs Useful?

2

Tesla K20 Speed-Up over Sandy Bridge CPUs

CPU results: Dual socket E5-2687w, 3.10 GHz, GPU results: Dual socket E5-2687w + 2 Tesla K20X GPUs

*MATLAB results comparing one i7-2600K CPU vs with Tesla K20 GPU

Disclaimer: Non-NVIDIA implementations may not have been fully optimized

0.0x 5.0x 10.0x 15.0x 20.0x

AMBER

SPECFEM3D

Chroma

MATLAB (FFT)*Engineering

Earth

Science

Physics

Molecular

Dynamics

© NVIDIA 2013

Page 3: CUDA on BioHPC

Molecular Dynamics (GROMACS / AMBER…)

3

Folding@Home

Most famous distributed HPC effort

Uses GROMACS core to run simulations

122,784 active CPUs 472 TFLOPS

47,226 active GPUs 14,183 TFLOPS

On average, each GPU is as good as 78 CPUs

(Total GPU capacity equivalent to ~6500 of our GPU nodes!!)

Page 4: CUDA on BioHPC

DNA Sequence Alignment

4

MUMmer GPU 2.0

Align 1,000,000 simulated 150 bp reads to E. Coli K12 genome sequence.

CPU K20 GPU

20.09s90.6s

4.5x speedup

Page 5: CUDA on BioHPC

Phylogenetics

5

Ribosomal Proteins - 48 taxa, 11949 character, 8 chains

CPU K20 GPU

66s389s

5.9x speedup

MrBayes 3.2.5 (CPU single node parallel) vs GPUMrBayes 3.1.2 (GPU K20)

Analysis time for 8000 generations

Page 6: CUDA on BioHPC

GROMACS on BioHPC

6

Lysozyme in water tutorialCPU K20 GPU

57.48 ns/day36.6 ns/day

1.6x speedup

(Very simple system. Speed-up will grow for more complex systems ~3-4x max)

Page 7: CUDA on BioHPC

What are GPUs? What is CUDA?

Massively parallel code, running on a Graphics Processing Unit

Page 8: CUDA on BioHPC

What are GPUs? What is CUDA

8

Application Code

GPU CPU

Use GPU to Parallelize

Compute-Intensive Functions

Rest of SequentialCPU Code

© NVIDIA 2013

Page 9: CUDA on BioHPC

CUDA on BioHPC - Hardware

9

Nucleus042-43 Tesla K40

Nucleus044-49 Tesla K20

Page 10: CUDA on BioHPC

CUDA on BioHPC - Hardware

10

BioHPC Workstations (Dell Precision Towers)

NVIDIA QUADRO 600/K600/K620

1/1/2GB of RAM (shared with display)

96/192/384 cores

Good for developing code.Not always faster than CPU code on these machines (esp. using Intel compiler and MKL)

Page 11: CUDA on BioHPC

11

Applications

Libraries

“Drop-in”

Acceleration

Programming Languages

OpenACC Directives

Easily Accelerate

Applications

Maximum

Flexibility

How to use CUDA

Page 12: CUDA on BioHPC

GPU Software – BioHPC Portal

12

List of GPU-supporting software will be added to the BioHPC portal soon!

Page 13: CUDA on BioHPC

CUDA on BioHPC - Software

13

module load cuda65 NVIDIA CUDA toolkitFor writing and building CUDA C/C++/FortranLibraries - cuBLAS, thrust etc.

module load cuda65/nsight CUDA Debugging / profiling

Also various software available with GPU support:

pycuda in python/xxx-anaconda

gputools in R/3.2.1-intel

Parallel Computing Toolbox in matlab

GPU support in apps – e.g. gromacs, various SBGRID tools etc.

Page 14: CUDA on BioHPC

GPU-Accelerated Libraries

14

NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP

Vector SignalImage Processing

GPU AcceleratedLinear Algebra

Matrix Algebra on GPU and Multicore NVIDIA cuFFT

C++ STL Features for CUDAIMSL Library

Building-block Algorithms for CUDA

ArrayFire Matrix Computations

Sparse Linear Algebra

Page 15: CUDA on BioHPC

Programming Languages

15

Page 16: CUDA on BioHPC

Worked Example – Matrix Multiplication

16

Let’s multiply matrices with:

• CPU & GPU naively• GPU with some optimization• CPU & GPU with MKL / cuBLAS

Which will be fastest?!

Why is matrix multiplication well suited to GPUs?

Page 17: CUDA on BioHPC

CPU vs GPU – What’s Different?

17

Few fast cores, very cleverMany slower cores and not as smartBut faster overall if you are clever!

CPU GPU

Page 18: CUDA on BioHPC

Running on the GPU – Load data into GPU RAM

18

1. Copy input data from CPU memory

to GPU memory

PCI Bus

© NVIDIA 2013

Page 19: CUDA on BioHPC

Running on the GPU – Parallel Code Execution

19

1. Copy input data from CPU memory

to GPU memory

2. Load GPU program and execute,

manually caching data in shared

memory for performance

© NVIDIA 2013

PCI Bus

Page 20: CUDA on BioHPC

Running on the GPU – Retrieving the Results

20

1. Copy input data from CPU memory

to GPU memory

2. Load GPU program and execute,

manually caching data on shared

memory for performance

3. Copy results from GPU memory to

CPU memory

© NVIDIA 2013

PCI Bus

Page 21: CUDA on BioHPC

Matrix Multiplication

21

Each output value can be independently computed.

We can chop the problem up into manageable pieces to run in parallel.

How do we do this on the GPU?

Page 22: CUDA on BioHPC

Grids, Blocks, Threads, Warps, Stream Processors – Oh my!

22

A problem needs the same calculation(s) performed N times on different data.

The calculation needed for a single output is implemented as a kernel.

The entire problem is represented in a 1/2/3D grid structure.

The grid is split up into smaller blocks, to fit the architecture of the GPU

Page 23: CUDA on BioHPC

Grids, Blocks, Threads, Warps, Stream Processors – Oh my!

23

Inside each block the calculation on each piece of data will be performed by a separate thread executing the kernel.

Threads in a block are run in groups called warps.

All threads in a warp run the same instruction at the same time, in parallel.

Blocks and their warps are scheduled across multiple stream processors.

Page 24: CUDA on BioHPC

CPU Matrix Multiplication (Naïve with OpenMP)

24

void cpu_naive_mmul(const float *A, const float *B, float *C,const int M, const int K, const int N) {

#pragma omp parallel forfor( int i=0; i < M; i++ ){

for( int j=0; j < N; j++){

float sum = 0.0;for( int k=0; k < K; k++ ){

sum += A[ (i * M) + k ] * B[ (k * N) + j];}

C[ (i * M) + j ] = sum;}

}

}

1_naive_mmult.cu

Page 25: CUDA on BioHPC

A GPU Matrix Multiplication Kernel (Naïve)

25

__global__ void gpu_naive_mmul(const float *A, const float *B,float *C, const int M, const int K, const int N){

int i = blockIdx.y * blockDim.y + threadIdx.y;int j = blockIdx.x * blockDim.x + threadIdx.x;

if( (i >= M) || (j >= N) ) return;

float sum = 0.0;

for( int k = 0; k <K; ++k) {sum += A[ (i * M) + k ] * B[ (k * N) + j];

}

C[ (i * M) + j ] = sum;

}

1_naive_mmult.cu

Page 26: CUDA on BioHPC

Initializing some matrices & copying to the GPU

26

// Allocate matrices on the hostfloat *h_A = (float *)malloc(nr_rows_A * nr_cols_A * sizeof(float));float *h_B = (float *)malloc(nr_rows_B * nr_cols_B * sizeof(float));float *h_C = (float *)malloc(nr_rows_C * nr_cols_C * sizeof(float));

// Fill the input matrices A and B with numbersinit_const( h_A, nr_cols_A * nr_rows_A );init_const( h_B, nr_cols_B * nr_rows_B );

// Allocate matrices on the GPUfloat *d_A, *d_B, *d_C;cudaMalloc(&d_A,nr_rows_A * nr_cols_A * sizeof(float));cudaMalloc(&d_B,nr_rows_B * nr_cols_B * sizeof(float));cudaMalloc(&d_C,nr_rows_C * nr_cols_C * sizeof(float));

// Copy the matrices to the GPUcudaMemcpy(d_A, h_A, nr_rows_A * nr_cols_A * sizeof(float), cudaMemcpyHostToDevice);cudaMemcpy(d_B, h_B, nr_rows_B * nr_cols_B * sizeof(float), cudaMemcpyHostToDevice);

1_naive_mmult.cu

Page 27: CUDA on BioHPC

Running the Kernel

27

// Setup the size of our blocks and griddim3 dimBlock(16,16);dim3 dimGrid( nr_cols_B/dimBlock.x, nr_cols_A/dimBlock.y );

// Perform the actual multiplication on the GPU with cuBLASgpu_naive_mmul<<<dimGrid, dimBlock>>>(

d_A, d_B, d_C, nr_rows_A, nr_cols_A, nr_cols_B);

// Wait for the GPU to finishcudaDeviceSynchronize();

1_naive_mmult.cu

Page 28: CUDA on BioHPC

Getting the result, and cleaning up

28

// Copy from the device to the hostcudaMemcpy(h_C, d_C,

nr_rows_C * nr_cols_C * sizeof(float),cudaMemcpyDeviceToHost);

//Free GPU memorycudaFree(d_A);cudaFree(d_B);cudaFree(d_C);

1_naive_mmult.cu

Page 29: CUDA on BioHPC

How Fast?

29

Multiplying 4096x4096 square matrices of floating point values.

16,777,216 output values, each requiring 4,096 multiplications and 4,096 additions.

~ 137 GFLOP – Billion floating point operations.

CPU Routine takes 6.5s ~ 21 GFLOP/s

GPU Routine takes 1.7s ~ 82 GFLOP/s + 0.056s for data transfer

Each floating point value is 4 bytes.

Memory usage is 3 * 4096 * 4096 *4 = 192MB

Page 30: CUDA on BioHPC

Speeding up the GPU Kernel

30

Fast AccessKB of space

(48Kb on our GPUs)

Slow AccessGB of space

(6/12 GB on our GPUs)

Page 31: CUDA on BioHPC

Speeding up the GPU Kernel

31

Each block generates some of the output matrix

Doesn’t need all of the input matrices.

Put the bits it needs into fast shared memory!

Page 32: CUDA on BioHPC

Tiled Matrix Multiplication Kernel

32

__global__ void gpu_tiled_mmul(const float *A, const float *B, float *C, const int wA, const int wB){

// Block index & Thread Indexint bx = blockIdx.x; int by = blockIdx.y;int tx = threadIdx.x; int ty = threadIdx.y;// Index of the first sub-matrix of A processed by the blockint aBegin = wA * BLOCK_SIZE * by;// Index of the last sub-matrix of A processed by the blockint aEnd = aBegin + wA - 1;// Step size used to iterate through the sub-matrices of Aint aStep = BLOCK_SIZE;// Index of the first sub-matrix of B processed by the blockint bBegin = BLOCK_SIZE * bx;// Step size used to iterate through the sub-matrices of Bint bStep = BLOCK_SIZE * wB;

// Loop over all the sub-matrices of A and B// required to compute the block sub-matrixfor (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) {

// Shared memory for tiles from A and B__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load the matrices from device memory// to shared memory; each thread loads// one element of each matrixAs[ty][tx] = A[a + wA * ty + tx];Bs[ty][tx] = B[b + wB * ty + tx];

2_tiled_mmult.cu

Page 33: CUDA on BioHPC

Tiled Matrix Multiplication Kernel

33

// Synchronize to make sure the matrices are loaded__syncthreads();

// Multiply the two matrices together;// each thread computes one element// of the block sub-matrix

#pragma unrollfor (int k = 0; k < BLOCK_SIZE; ++k){

Csub += As[ty][k] * Bs[k][tx];}

// Synchronize to make sure that the preceding// computation is done before loading two new// sub-matrices of A and B in the next iteration__syncthreads();

}

// Write the block sub-matrix to device memory;// each thread writes one elementint c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;C[c + wB * ty + tx] = Csub;

}

2_tiled_mmult.cu

Page 34: CUDA on BioHPC

How Much Faster?

34

Naïve GPU 1.7s ~ 82 GFLOP/s

Tiled GPU 0.673s ~ 204 GFLOP/s

Milliseconds Copy To Execute Copy Back

GPU Naive 36 1653 20GPU Tiled 37 673 20

2_tiled_mmult.cu

Page 35: CUDA on BioHPC

What About Libraries?

35

BLAS is a widely used library for linear algebra

Intel MKL is an optimized implementation for Intel CPUs

MKL BLAS 0.464s ~ 295 GFLOP/s

// Intel MKL headers (for MKL BLAS)#include <mkl.h>

void cpu_blas_mmul(const float *A, const float *B, float *C,const int m, const int k, const int n) {

cblas_sgemm( CblasRowMajor, CblasNoTrans, CblasNoTrans,m, n, k, 1.0, A, m, B, k, 0.0, C, m);

}

3_blas_mmult.cu

Page 36: CUDA on BioHPC

cuBLAS

36

cuBLAS 0.062s ~ 2.2 TFLOP/s

void gpu_blas_mmul(const float *A, const float *B, float *C, const int m, constint k, const int n) {

const float alf = 1;const float bet = 0;const float *alpha = &alf;const float *beta = &bet;

// Create a handle for CUBLAScublasHandle_t handle;cublasCreate(&handle);

// Do the actual multiplicationcublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, alpha, A, m, B, k,

beta, C, m);

// Destroy the handlecublasDestroy(handle);

}

3_blas_mmult.cu

Page 37: CUDA on BioHPC

Comparing All the Methods……

37

6491 464 1653 673 62

CPU Naive OMP CPU MKL BLAS GPU Naive GPU Tiled GPU cuBLAS

0

1000

2000

3000

4000

5000

6000

7000

Mill

ise

con

ds

~7.5x Speedup

4096x4096 square matrix multiplication

Page 38: CUDA on BioHPC

Thrust – GPU Implementations of Standard Algorithms

38

#include <thrust/host_vector.h>#include <thrust/device_vector.h>#include <thrust/generate.h>#include <thrust/sort.h>

...

thrust::host_vector<int> h_vec(200000);thrust::generate(h_vec.begin(), h_vec.end(), rand);

// transfer to device and sortthrust::device_vector<int> d_vec = h_vec;thrust::sort(d_vec.begin(), d_vec.end());

// show resultthrust::host_vector<int> h_result = d_vec;std::cerr << "third item in sorted data:" << d_vec[2] << std::endl;

Page 39: CUDA on BioHPC

When should I use CUDA?

39

1) Amount of computation must be >> amount of memory transfer

2) Can represent problem as a 1/2/3D grid of independent blocks and threads

3) There are *many* independent calculationsK20 has 2,496 cores – need to fill them upMust have threads >> cores for good efficiency

4) Problem fits inside the smaller RAM on GPU – 6/12GB vs 128/256/384GBor subdivide the problem – but see 1

5) A library is available for my task, or I am comfortable thinking about computer architectures.

Page 40: CUDA on BioHPC

Help!

40

https://developer.nvidia.com/resources

Fantastic documentation, tutorials, example code, videos, forums…

Page 41: CUDA on BioHPC

Acknowledgements

41

Various figures and examples from:

MSDN Article – NVIDIA-GPU-Architecturehttps://code.msdn.microsoft.com/windowsdesktop/NVIDIA-GPU-Architecture-45c11e6dAlan TatourianApache License v2.0

NVIDIA CUDA Education Materialshttps://developer.nvidia.com/cuda-educationMark Ebersole, Mark Harris, Thomas Bradley – NVIDIA Corporation