Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
CUDA on BioHPCApplications & Parallel Programming with GPUs
1 Updated for 2015-08-19
Why are CUDA and GPUs Useful?
2
Tesla K20 Speed-Up over Sandy Bridge CPUs
CPU results: Dual socket E5-2687w, 3.10 GHz, GPU results: Dual socket E5-2687w + 2 Tesla K20X GPUs
*MATLAB results comparing one i7-2600K CPU vs with Tesla K20 GPU
Disclaimer: Non-NVIDIA implementations may not have been fully optimized
0.0x 5.0x 10.0x 15.0x 20.0x
AMBER
SPECFEM3D
Chroma
MATLAB (FFT)*Engineering
Earth
Science
Physics
Molecular
Dynamics
© NVIDIA 2013
Molecular Dynamics (GROMACS / AMBER…)
3
Folding@Home
Most famous distributed HPC effort
Uses GROMACS core to run simulations
122,784 active CPUs 472 TFLOPS
47,226 active GPUs 14,183 TFLOPS
On average, each GPU is as good as 78 CPUs
(Total GPU capacity equivalent to ~6500 of our GPU nodes!!)
DNA Sequence Alignment
4
MUMmer GPU 2.0
Align 1,000,000 simulated 150 bp reads to E. Coli K12 genome sequence.
CPU K20 GPU
20.09s90.6s
4.5x speedup
Phylogenetics
5
Ribosomal Proteins - 48 taxa, 11949 character, 8 chains
CPU K20 GPU
66s389s
5.9x speedup
MrBayes 3.2.5 (CPU single node parallel) vs GPUMrBayes 3.1.2 (GPU K20)
Analysis time for 8000 generations
GROMACS on BioHPC
6
Lysozyme in water tutorialCPU K20 GPU
57.48 ns/day36.6 ns/day
1.6x speedup
(Very simple system. Speed-up will grow for more complex systems ~3-4x max)
What are GPUs? What is CUDA?
Massively parallel code, running on a Graphics Processing Unit
What are GPUs? What is CUDA
8
Application Code
GPU CPU
Use GPU to Parallelize
Compute-Intensive Functions
Rest of SequentialCPU Code
© NVIDIA 2013
CUDA on BioHPC - Hardware
9
Nucleus042-43 Tesla K40
Nucleus044-49 Tesla K20
CUDA on BioHPC - Hardware
10
BioHPC Workstations (Dell Precision Towers)
NVIDIA QUADRO 600/K600/K620
1/1/2GB of RAM (shared with display)
96/192/384 cores
Good for developing code.Not always faster than CPU code on these machines (esp. using Intel compiler and MKL)
11
Applications
Libraries
“Drop-in”
Acceleration
Programming Languages
OpenACC Directives
Easily Accelerate
Applications
Maximum
Flexibility
How to use CUDA
GPU Software – BioHPC Portal
12
List of GPU-supporting software will be added to the BioHPC portal soon!
CUDA on BioHPC - Software
13
module load cuda65 NVIDIA CUDA toolkitFor writing and building CUDA C/C++/FortranLibraries - cuBLAS, thrust etc.
module load cuda65/nsight CUDA Debugging / profiling
Also various software available with GPU support:
pycuda in python/xxx-anaconda
gputools in R/3.2.1-intel
Parallel Computing Toolbox in matlab
GPU support in apps – e.g. gromacs, various SBGRID tools etc.
GPU-Accelerated Libraries
14
NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP
Vector SignalImage Processing
GPU AcceleratedLinear Algebra
Matrix Algebra on GPU and Multicore NVIDIA cuFFT
C++ STL Features for CUDAIMSL Library
Building-block Algorithms for CUDA
ArrayFire Matrix Computations
Sparse Linear Algebra
Programming Languages
15
Worked Example – Matrix Multiplication
16
Let’s multiply matrices with:
• CPU & GPU naively• GPU with some optimization• CPU & GPU with MKL / cuBLAS
Which will be fastest?!
Why is matrix multiplication well suited to GPUs?
CPU vs GPU – What’s Different?
17
Few fast cores, very cleverMany slower cores and not as smartBut faster overall if you are clever!
CPU GPU
Running on the GPU – Load data into GPU RAM
18
1. Copy input data from CPU memory
to GPU memory
PCI Bus
© NVIDIA 2013
Running on the GPU – Parallel Code Execution
19
1. Copy input data from CPU memory
to GPU memory
2. Load GPU program and execute,
manually caching data in shared
memory for performance
© NVIDIA 2013
PCI Bus
Running on the GPU – Retrieving the Results
20
1. Copy input data from CPU memory
to GPU memory
2. Load GPU program and execute,
manually caching data on shared
memory for performance
3. Copy results from GPU memory to
CPU memory
© NVIDIA 2013
PCI Bus
Matrix Multiplication
21
Each output value can be independently computed.
We can chop the problem up into manageable pieces to run in parallel.
How do we do this on the GPU?
Grids, Blocks, Threads, Warps, Stream Processors – Oh my!
22
A problem needs the same calculation(s) performed N times on different data.
The calculation needed for a single output is implemented as a kernel.
The entire problem is represented in a 1/2/3D grid structure.
The grid is split up into smaller blocks, to fit the architecture of the GPU
Grids, Blocks, Threads, Warps, Stream Processors – Oh my!
23
Inside each block the calculation on each piece of data will be performed by a separate thread executing the kernel.
Threads in a block are run in groups called warps.
All threads in a warp run the same instruction at the same time, in parallel.
Blocks and their warps are scheduled across multiple stream processors.
CPU Matrix Multiplication (Naïve with OpenMP)
24
void cpu_naive_mmul(const float *A, const float *B, float *C,const int M, const int K, const int N) {
#pragma omp parallel forfor( int i=0; i < M; i++ ){
for( int j=0; j < N; j++){
float sum = 0.0;for( int k=0; k < K; k++ ){
sum += A[ (i * M) + k ] * B[ (k * N) + j];}
C[ (i * M) + j ] = sum;}
}
}
1_naive_mmult.cu
A GPU Matrix Multiplication Kernel (Naïve)
25
__global__ void gpu_naive_mmul(const float *A, const float *B,float *C, const int M, const int K, const int N){
int i = blockIdx.y * blockDim.y + threadIdx.y;int j = blockIdx.x * blockDim.x + threadIdx.x;
if( (i >= M) || (j >= N) ) return;
float sum = 0.0;
for( int k = 0; k <K; ++k) {sum += A[ (i * M) + k ] * B[ (k * N) + j];
}
C[ (i * M) + j ] = sum;
}
1_naive_mmult.cu
Initializing some matrices & copying to the GPU
26
// Allocate matrices on the hostfloat *h_A = (float *)malloc(nr_rows_A * nr_cols_A * sizeof(float));float *h_B = (float *)malloc(nr_rows_B * nr_cols_B * sizeof(float));float *h_C = (float *)malloc(nr_rows_C * nr_cols_C * sizeof(float));
// Fill the input matrices A and B with numbersinit_const( h_A, nr_cols_A * nr_rows_A );init_const( h_B, nr_cols_B * nr_rows_B );
// Allocate matrices on the GPUfloat *d_A, *d_B, *d_C;cudaMalloc(&d_A,nr_rows_A * nr_cols_A * sizeof(float));cudaMalloc(&d_B,nr_rows_B * nr_cols_B * sizeof(float));cudaMalloc(&d_C,nr_rows_C * nr_cols_C * sizeof(float));
// Copy the matrices to the GPUcudaMemcpy(d_A, h_A, nr_rows_A * nr_cols_A * sizeof(float), cudaMemcpyHostToDevice);cudaMemcpy(d_B, h_B, nr_rows_B * nr_cols_B * sizeof(float), cudaMemcpyHostToDevice);
1_naive_mmult.cu
Running the Kernel
27
// Setup the size of our blocks and griddim3 dimBlock(16,16);dim3 dimGrid( nr_cols_B/dimBlock.x, nr_cols_A/dimBlock.y );
// Perform the actual multiplication on the GPU with cuBLASgpu_naive_mmul<<<dimGrid, dimBlock>>>(
d_A, d_B, d_C, nr_rows_A, nr_cols_A, nr_cols_B);
// Wait for the GPU to finishcudaDeviceSynchronize();
1_naive_mmult.cu
Getting the result, and cleaning up
28
// Copy from the device to the hostcudaMemcpy(h_C, d_C,
nr_rows_C * nr_cols_C * sizeof(float),cudaMemcpyDeviceToHost);
//Free GPU memorycudaFree(d_A);cudaFree(d_B);cudaFree(d_C);
1_naive_mmult.cu
How Fast?
29
Multiplying 4096x4096 square matrices of floating point values.
16,777,216 output values, each requiring 4,096 multiplications and 4,096 additions.
~ 137 GFLOP – Billion floating point operations.
CPU Routine takes 6.5s ~ 21 GFLOP/s
GPU Routine takes 1.7s ~ 82 GFLOP/s + 0.056s for data transfer
Each floating point value is 4 bytes.
Memory usage is 3 * 4096 * 4096 *4 = 192MB
Speeding up the GPU Kernel
30
Fast AccessKB of space
(48Kb on our GPUs)
Slow AccessGB of space
(6/12 GB on our GPUs)
Speeding up the GPU Kernel
31
Each block generates some of the output matrix
Doesn’t need all of the input matrices.
Put the bits it needs into fast shared memory!
Tiled Matrix Multiplication Kernel
32
__global__ void gpu_tiled_mmul(const float *A, const float *B, float *C, const int wA, const int wB){
// Block index & Thread Indexint bx = blockIdx.x; int by = blockIdx.y;int tx = threadIdx.x; int ty = threadIdx.y;// Index of the first sub-matrix of A processed by the blockint aBegin = wA * BLOCK_SIZE * by;// Index of the last sub-matrix of A processed by the blockint aEnd = aBegin + wA - 1;// Step size used to iterate through the sub-matrices of Aint aStep = BLOCK_SIZE;// Index of the first sub-matrix of B processed by the blockint bBegin = BLOCK_SIZE * bx;// Step size used to iterate through the sub-matrices of Bint bStep = BLOCK_SIZE * wB;
// Loop over all the sub-matrices of A and B// required to compute the block sub-matrixfor (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) {
// Shared memory for tiles from A and B__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load the matrices from device memory// to shared memory; each thread loads// one element of each matrixAs[ty][tx] = A[a + wA * ty + tx];Bs[ty][tx] = B[b + wB * ty + tx];
2_tiled_mmult.cu
Tiled Matrix Multiplication Kernel
33
// Synchronize to make sure the matrices are loaded__syncthreads();
// Multiply the two matrices together;// each thread computes one element// of the block sub-matrix
#pragma unrollfor (int k = 0; k < BLOCK_SIZE; ++k){
Csub += As[ty][k] * Bs[k][tx];}
// Synchronize to make sure that the preceding// computation is done before loading two new// sub-matrices of A and B in the next iteration__syncthreads();
}
// Write the block sub-matrix to device memory;// each thread writes one elementint c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;C[c + wB * ty + tx] = Csub;
}
2_tiled_mmult.cu
How Much Faster?
34
Naïve GPU 1.7s ~ 82 GFLOP/s
Tiled GPU 0.673s ~ 204 GFLOP/s
Milliseconds Copy To Execute Copy Back
GPU Naive 36 1653 20GPU Tiled 37 673 20
2_tiled_mmult.cu
What About Libraries?
35
BLAS is a widely used library for linear algebra
Intel MKL is an optimized implementation for Intel CPUs
MKL BLAS 0.464s ~ 295 GFLOP/s
// Intel MKL headers (for MKL BLAS)#include <mkl.h>
void cpu_blas_mmul(const float *A, const float *B, float *C,const int m, const int k, const int n) {
cblas_sgemm( CblasRowMajor, CblasNoTrans, CblasNoTrans,m, n, k, 1.0, A, m, B, k, 0.0, C, m);
}
3_blas_mmult.cu
cuBLAS
36
cuBLAS 0.062s ~ 2.2 TFLOP/s
void gpu_blas_mmul(const float *A, const float *B, float *C, const int m, constint k, const int n) {
const float alf = 1;const float bet = 0;const float *alpha = &alf;const float *beta = &bet;
// Create a handle for CUBLAScublasHandle_t handle;cublasCreate(&handle);
// Do the actual multiplicationcublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, alpha, A, m, B, k,
beta, C, m);
// Destroy the handlecublasDestroy(handle);
}
3_blas_mmult.cu
Comparing All the Methods……
37
6491 464 1653 673 62
CPU Naive OMP CPU MKL BLAS GPU Naive GPU Tiled GPU cuBLAS
0
1000
2000
3000
4000
5000
6000
7000
Mill
ise
con
ds
~7.5x Speedup
4096x4096 square matrix multiplication
Thrust – GPU Implementations of Standard Algorithms
38
#include <thrust/host_vector.h>#include <thrust/device_vector.h>#include <thrust/generate.h>#include <thrust/sort.h>
...
thrust::host_vector<int> h_vec(200000);thrust::generate(h_vec.begin(), h_vec.end(), rand);
// transfer to device and sortthrust::device_vector<int> d_vec = h_vec;thrust::sort(d_vec.begin(), d_vec.end());
// show resultthrust::host_vector<int> h_result = d_vec;std::cerr << "third item in sorted data:" << d_vec[2] << std::endl;
When should I use CUDA?
39
1) Amount of computation must be >> amount of memory transfer
2) Can represent problem as a 1/2/3D grid of independent blocks and threads
3) There are *many* independent calculationsK20 has 2,496 cores – need to fill them upMust have threads >> cores for good efficiency
4) Problem fits inside the smaller RAM on GPU – 6/12GB vs 128/256/384GBor subdivide the problem – but see 1
5) A library is available for my task, or I am comfortable thinking about computer architectures.
Help!
40
https://developer.nvidia.com/resources
Fantastic documentation, tutorials, example code, videos, forums…
Acknowledgements
41
Various figures and examples from:
MSDN Article – NVIDIA-GPU-Architecturehttps://code.msdn.microsoft.com/windowsdesktop/NVIDIA-GPU-Architecture-45c11e6dAlan TatourianApache License v2.0
NVIDIA CUDA Education Materialshttps://developer.nvidia.com/cuda-educationMark Ebersole, Mark Harris, Thomas Bradley – NVIDIA Corporation