56
© NVIDIA Corporation 2013 GPU Parallel Computing Zehuan Wang HPC Developer Technology Engineer

GPU Parallel Computing

  • Upload
    muncel

  • View
    87

  • Download
    4

Embed Size (px)

DESCRIPTION

GPU Parallel Computing. Zehuan Wang HPC Developer Technology Engineer. Access The Power of GPU. Applications. Programming Languages. Libraries. OpenACC Directives. GPU Accelerated Libraries “ Drop-in” A cceleration for y our Applications. NVIDIA cuSPARSE. NVIDIA cuBLAS. - PowerPoint PPT Presentation

Citation preview

Page 1: GPU Parallel Computing

© NVIDIA Corporation 2013

GPU Parallel Computing

Zehuan Wang

HPC Developer Technology Engineer

Page 2: GPU Parallel Computing

© NVIDIA Corporation 2013

Access The Power of GPU

Applications

LibrariesProgramming Languages

OpenACCDirectives

Page 3: GPU Parallel Computing

© NVIDIA Corporation 2013

NVIDIA cuFFTNVIDIA cuSPARSE

GPU Accelerated Libraries“Drop-in” Acceleration for your

Applications

NVIDIA cuBLAS

NVIDIA cuRAND

NVIDIA NPP

Vector SignalImage

ProcessingMatrix Algebra on

GPU and Multicore

C++ Templated Parallel Algorithms IMSL Library

GPU Accelerated

Linear Algebra

Building-block AlgorithmsCenterSpace

NMath

Page 4: GPU Parallel Computing

© NVIDIA Corporation 2013

GPU Programming Languages

OpenACC, CUDA FortranFortran

OpenACC, CUDA CC

CUDA C++, Thrust, Hemi, ArrayFireC++

Anaconda Accelerate, PyCUDA, CopperheadPython

MATLAB, Mathematica, LabVIEWNumerical analytics

developer.nvidia.com/language-solutions

CUDAfy.NET, Alea.cuBase.NET

Page 5: GPU Parallel Computing

© NVIDIA Corporation 2013

GPU Architecture

Page 6: GPU Parallel Computing

© NVIDIA Corporation 2013

GPU: Massively Parallel Coprocessor

A GPU isCoprocessor to the CPU or HostHas its own DRAMRuns 1000s of threads in parallelSingle Precision: 4.58TFlop/sDouble Precision: 1.31TFlop/s

Page 7: GPU Parallel Computing

© NVIDIA Corporation 2013

Heterogeneous Parallel Computing

Latency-OptimizedFast Serial Processing

Logic()

Compute()

Page 8: GPU Parallel Computing

© NVIDIA Corporation 2013

Heterogeneous Parallel Computing

Latency-OptimizedFast Serial Processing

Throughput-OptimizedFast Parallel Processing

Logic()

Compute()

Page 9: GPU Parallel Computing

© NVIDIA Corporation 2013

Heterogeneous Parallel Computing

Latency-OptimizedFast Serial Processing

Throughput-OptimizedFast Parallel Processing

Logic()

Compute()

Page 10: GPU Parallel Computing

© NVIDIA Corporation 2013

GPU in Computer System

Connected to CPU chipset by PCIe16GB/s One Way, 32GB/s in both way

PCIeCPU DRAM

DDR3

Page 11: GPU Parallel Computing

© NVIDIA Corporation 2013

GPU High Level View

Streaming Multiprocessor (SM)A set of CUDA cores

Global memory

Page 12: GPU Parallel Computing

© NVIDIA Corporation 2013

GK110 SMControl unit

4 Warp Scheduler8 instruction dispatcher

Execution unit192 single-precision CUDA Cores64 double-precision CUDA Cores32 SFU, 32 LD/ST

MemoryRegisters: 64K 32-bitCache

L1+shared memory (64 KB)TextureConstant

Page 13: GPU Parallel Computing

© NVIDIA Corporation 2013

Kepler/Fermi Memory Hierarchy

3 levels, very similar to CPURegister

Spills to local memoryCaches

Shared memoryL1 cacheL2 cacheConstant cacheTexture cache

Global memory

Page 14: GPU Parallel Computing

© NVIDIA Corporation 2013

Kepler/Fermi Memory Hierarchy

L2

Global Memory

Registers

C

SM-0

L1&SMEM TEX

Registers

C

SM-1

L1&SMEM TEX

Registers

C

SM-N

L1&SMEM TEX

Page 15: GPU Parallel Computing

© NVIDIA Corporation 2013

Basic Concepts

PCI Bus

Transfer data

Offload computation GPU

GPU Memory

CPU

CPU Memory

GPU computing is all about 2 things:• Transfer data between CPU-GPU• Do parallel computing on GPU

Page 16: GPU Parallel Computing

© NVIDIA Corporation 2013

GPU Programming Basics

Page 17: GPU Parallel Computing

© NVIDIA Corporation 2013

How To Get Start

CUDA C/C++: download CUDA drivers & compilers & samples (All In One Package ) free from:

http://developer.nvidia.com/cuda/cuda-downloads

CUDA Fortran: PGI

OpenACC: PGI, CAPS, Cray

Page 18: GPU Parallel Computing

© NVIDIA Corporation 2013

CUDA Programming Basics

Hello WorldBasic syntax, compile & run

GPU memory managementMalloc/freememcpy

Writing parallel kernelsThreads & blockMemory hierarchy

Page 19: GPU Parallel Computing

© NVIDIA Corporation 2013

Heterogeneous Computing

Executes on both CPU & GPUSimilar to OpenMP’s fork-join pattern

Accelerated kernelsCUDA: simple extensionsto C/C++

DeviceGrid 0

Block (2, 1)Block (1, 1)Block (0, 1)

Block (2, 0)Block (1, 0)Block (0, 0)

Host

C Program Sequential Execution Serial code

Parallel kernel Kernel0<<<>>>()

Serial code

Parallel kernel Kernel1<<<>>>()

Host

DeviceGrid 1

Block (1, 1)

Block (1, 0)

Block (1, 2)

Block (0, 1)

Block (0, 0)

Block (0, 2)

Page 20: GPU Parallel Computing

© NVIDIA Corporation 2013

Hello World on CPU

hello_world.c:

#include <stdio.h>

void hello_world_kernel(){ printf(“Hello World\n”);}

int main(){ hello_world_kernel();}

Compile & Run:gcc hello_world.c./a.out

Page 21: GPU Parallel Computing

© NVIDIA Corporation 2013

Hello World on GPU

hello_world.cu:

#include <stdio.h>

__global__ void hello_world_kernel(){ printf(“Hello World\n”);}

int main(){ hello_world_kernel<<<1,1>>>();}

Compile & Run:nvcc hello_world.cu./a.out

Page 22: GPU Parallel Computing

© NVIDIA Corporation 2013

Hello World on GPU

CUDA kernel within .cu files.cu files compiled by nvccCUDA kernels preceded by “__global__”CUDA kernels launched with “<<<…,…>>>”

hello_world.cu:

#include <stdio.h>

__global__ void hello_world_kernel(){ printf(“Hello World\n”);}

int main(){ hello_world_kernel<<<1,1>>>();}

Compile & Run:nvcc hello_world.cu./a.out

Page 23: GPU Parallel Computing

© NVIDIA Corporation 2013

Memory Spaces

CPU and GPU have separate memory spacesData is moved across PCIe bus

Use functions to allocate/set/copy memory on GPUVery similar to corresponding C functions

Page 24: GPU Parallel Computing

© NVIDIA Corporation 2013

CUDA C/C++ Memory Allocation / Release

Host (CPU) manages device (GPU) memory:cudaMalloc (void ** pointer, size_t nbytes)cudaMemset (void * pointer, int value, size_t count)cudaFree (void* pointer)

int nbytes = 1024*sizeof(int);int * d_a = 0;cudaMalloc( (void**)&d_a, nbytes );cudaMemset( d_a, 0, nbytes);cudaFree(d_a);

Page 25: GPU Parallel Computing

© NVIDIA Corporation 2013

Data Copies

cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction);

returns after the copy is completeblocks CPU thread until all bytes have been copieddoesn’t start copying until previous CUDA calls complete

enum cudaMemcpyKindcudaMemcpyHostToDevicecudaMemcpyDeviceToHostcudaMemcpyDeviceToDevice

Non-blocking memcopies are provided

Page 26: GPU Parallel Computing

© NVIDIA Corporation 2013

Code Walkthrough 1

Allocate CPU memory for n integersAllocate GPU memory for n integersInitialize GPU memory to 0sCopy from GPU to CPUPrint the values

Page 27: GPU Parallel Computing

© NVIDIA Corporation 2013

Code Walkthrough 1#include <stdio.h>

int main(){ int dimx = 16; int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

Page 28: GPU Parallel Computing

© NVIDIA Corporation 2013

Code Walkthrough 1#include <stdio.h>

int main(){ int dimx = 16; int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );

Page 29: GPU Parallel Computing

© NVIDIA Corporation 2013

Code Walkthrough 1#include <stdio.h>

int main(){ int dimx = 16; int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );

cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

Page 30: GPU Parallel Computing

© NVIDIA Corporation 2013

Code Walkthrough 1#include <stdio.h>

int main(){ int dimx = 16; int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );

cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("\n");

free( h_a ); cudaFree( d_a );

return 0;}

Page 31: GPU Parallel Computing

© NVIDIA Corporation 2013

Compile & Run

nvcc main.cu

./a.out0000000000000000

Page 32: GPU Parallel Computing

© NVIDIA Corporation 2013

Thread Hierarchy

2-level hierarchy: blocks and gridBlock = a group of up to 1024 threadsGrid = all blocks for a given kernel launchE.g. total 72 threads

blockDim=12, gridDim=6

A block can:Synchronize their executionCommunicate via shared memory

Size of grid and blocks are specified during kernel launchdim3 grid(6,1,1), block(12,1,1);kernel<<<grid, block>>>(…);

Grid 0

Block (2, 1)Block (1, 1)Block (0, 1)

Block (2, 0)Block (1, 0)Block (0,0)

Page 33: GPU Parallel Computing

© NVIDIA Corporation 2013

IDs and Dimensions

Threads:3D IDs, unique within a block

Blocks:3D IDs, unique within a grid

Built-in variables:threadIdx: idx within a blockblockIdx: idx within the gridblockDim: block dimensiongridDim: grid dimension

DeviceGrid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Page 34: GPU Parallel Computing

© NVIDIA Corporation 2013

GPU and Programming ModelSoftware GPU

Threads are executed by scalar processors

Thread

CUDA Core

Thread Block Multiprocessor

Thread blocks are executed on multiprocessors

...

Grid Device

A kernel is launched as a grid of thread blocks

Page 35: GPU Parallel Computing

© NVIDIA Corporation 2013

Which thread do I belong to?

0 1 2 3 0 1 2 30 1 2 30 1 2 3threadIdx.x:

blockIdx.x: 0 0 0 0 3 3 3 32 2 2 21 1 1 1

blockDim.x = 4, gridDim.x = 4

0 1 2 3 12 13 14 158 9 10 114 5 6 7idx = blockIdx.x*blockDim.x + threadIdx.x:

Page 36: GPU Parallel Computing

© NVIDIA Corporation 2013

Code Walkthrough 2: Simple Kernel

Allocate memory on GPUCopy the data from CPU to GPUWrite a kernel to perform a vector additionCopy the result to CPUFree the memory

Page 37: GPU Parallel Computing

© NVIDIA Corporation 2013

void vec_add(float *x,float *y,int n){ for (int i=0;i<n;++i) y[i]=x[i]+y[i];}

float *x=(float*)malloc(n*sizeof(float));float *y=(float*)malloc(n*sizeof(float));vec_add(x,y,n);free(x);free(y);

Vector Addition using C

Page 38: GPU Parallel Computing

© NVIDIA Corporation 2013

__global__ void vec_add(float *x,float *y,int n){ int i=blockIdx.x*blockDim.x+threadIdx.x; y[i]=x[i]+y[i];}

float *d_x,*d_y; cudaMalloc(&d_x,n*sizeof(float));cudaMalloc(&d_y,n*sizeof(float));cudaMemcpy(d_x,x,n*sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(d_y,y,n*sizeof(float),cudaMemcpyHostToDevice);vec_add<<<n/128,128>>>(d_x,d_y,n);cudaMemcpy(y,d_y,n*sizeof(float),cudaMemcpyDeviceToHost);cudaFree(d_x);cudaFree(d_y);

Vector Addition using CUDA C

Page 39: GPU Parallel Computing

© NVIDIA Corporation 2013

__global__ void vec_add(float *x,float *y,int n){ int i=blockIdx.x*blockDim.x+threadIdx.x; y[i]=x[i]+y[i];}

float *d_x,*d_y; cudaMalloc(&d_x,n*sizeof(float));cudaMalloc(&d_y,n*sizeof(float));cudaMemcpy(d_x,x,n*sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(d_y,y,n*sizeof(float),cudaMemcpyHostToDevice);vec_add<<<n/128,128>>>(d_x,d_y,n);cudaMemcpy(y,d_y,n*sizeof(float),cudaMemcpyDeviceToHost);cudaFree(d_x);cudaFree(d_y);

Vector Addition using CUDA C

Keyword for CUDA kernel

Page 40: GPU Parallel Computing

© NVIDIA Corporation 2013

__global__ void vec_add(float *x,float *y,int n){ int i=blockIdx.x*blockDim.x+threadIdx.x; y[i]=x[i]+y[i];}

float *d_x,*d_y; cudaMalloc(&d_x,n*sizeof(float));cudaMalloc(&d_y,n*sizeof(float));cudaMemcpy(d_x,x,n*sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(d_y,y,n*sizeof(float),cudaMemcpyHostToDevice);vec_add<<<n/128,128>>>(d_x,d_y,n);cudaMemcpy(y,d_y,n*sizeof(float),cudaMemcpyDeviceToHost);cudaFree(d_x);cudaFree(d_y);

Vector Addition using CUDA C

Thread index computationto replace loop

Page 41: GPU Parallel Computing

© NVIDIA Corporation 2013

GPU Memory Model ReviewThread

Per-threadLocal

Memory

Block

Per-blockSharedMemory

Kernel 0

. . .Per-device

GlobalMemory

. . .

Kernel 1

SequentialKernels

Page 42: GPU Parallel Computing

© NVIDIA Corporation 2013

Global MemoryKernel 0

. . .Per-device

GlobalMemory

. . .

Kernel 1

SequentialKernels

Data lifetime = from allocation to deallocationAccessible by all threads as well as host (CPU)

Page 43: GPU Parallel Computing

© NVIDIA Corporation 2013

Shared Memory

C/C++: __shared__ int a[SIZE];

Allocated per threadblock

Data lifetime = block lifetime

Accessible by any thread in the threadblockNot accessible to other threadblocks

BlockPer-blockSharedMemory

Page 44: GPU Parallel Computing

© NVIDIA Corporation 2013

Registers

ThreadPer-thread

Local Storage

Automatic variables (scalar/array) inside kernelsData lifetime = thread lifetimeAccessible only by the thread declares it

Page 45: GPU Parallel Computing

© NVIDIA Corporation 2013

Example of Using Shared Memory

Applying a 1D stencil to a 1D array of elements:Each output element is the sum of all elements within a radius

For example, for radius = 3, each output element is the sum of 7 input elements:

radius radius

Page 46: GPU Parallel Computing

© NVIDIA Corporation 2013

Example of Using Shared Memory

…1 2 3 4 5 6 7 2 3 4 5 6 7 8 3 4 5 6 7 8 …

……28…………………………………

Page 47: GPU Parallel Computing

© NVIDIA Corporation 2013

__global__ void stencil(int* in, int* out) { int globIdx = blockIdx.x * blockDim.x + threadIdx.x; int value = 0; for (offset = - RADIUS; offset <= RADIUS; offset++) value += in[globIdx + offset]; out[globIdx] = value;}

Kernel Code Using Global Memory

One element per thread

A lot of redundant read in neighboring threads: not an optimized way

Page 48: GPU Parallel Computing

© NVIDIA Corporation 2013

Implementation with Shared Memory

One element per threadRead (BLOCK_SIZE + 2 * RADIUS) elements from global memory to shared memoryCompute BLOCK_SIZE output elements in shared memoryWrite BLOCK_SIZE output elements to global memory

“halo”= RADIUS elements

on the left

“halo”= RADIUS elements

on the right

The BLOCK_SIZE input elements corresponding to the output elements

Page 49: GPU Parallel Computing

© NVIDIA Corporation 2013

__global__ void stencil(int* in, int* out) { __shared__ int shared[BLOCK_SIZE + 2 * RADIUS]; int globIdx = blockIdx.x * blockDim.x + threadIdx.x; int locIdx = threadIdx.x + RADIUS; shared[locIdx] = in[globIdx]; if (threadIdx.x < RADIUS) { shared[locIdx – RADIUS] = in[globIdx – RADIUS]; shared[locIdx + BLOCK_DIMX] = in[globIdx + BLOCK_SIZE]; } __syncthreads(); int value = 0; for (offset = - RADIUS; offset <= RADIUS; offset++) value += shared[locIdx + offset]; out[globIdx] = value;}

Kernel CodeRADIUS = 3

BLOCK_SIZE = 16

Page 50: GPU Parallel Computing

© NVIDIA Corporation 2013

Thread Synchronization Function

void __syncthreads();Synchronizes all threads in a thread block

Since threads are scheduled at run-timeOnce all threads have reached this point, execution resumes normallyUsed to avoid RAW / WAR / WAW hazards when accessing shared memory

Should be used in conditional code only if the conditional is uniform across the entire thread block

Otherwise may lead to deadlock

Page 51: GPU Parallel Computing

© NVIDIA Corporation 2013

Kepler/Fermi Memory Hierarchy

L2

Global Memory

Registers

C

SM-0

L1&SMEM TEX

Registers

C

SM-1

L1&SMEM TEX

Registers

C

SM-N

L1&SMEM TEX

Page 52: GPU Parallel Computing

© NVIDIA Corporation 2013

Constant Cache

Global variables marked by __constant__ are constant and can’t be changed in device.Will be cached by Constant CacheLocated in global memoryGood for threads access the same address

__constant__ int a=10;__global__ void kernel(){

a++; //error}

...

Memory addresses

Page 53: GPU Parallel Computing

© NVIDIA Corporation 2013

Texture Cache

Save Data as Texture :Provides hardware accelerated filtered sampling of data (1D, 2D, 3D)Read-only data cache holds fetched samplesBacked up by the L2 cache

Why use it?Separate pipeline from shared/L1Highest miss bandwidthFlexible, e.g. unaligned accesses

Tex

SMX

L2

TexTexTex

Read-onlyData Cache

Page 54: GPU Parallel Computing

© NVIDIA Corporation 2013

Texture Cache Unlocked In GK110

Added a new path for computeAvoids the texture unitAllows a global address to be fetched and cachedEliminates texture setup

Managed automatically by compiler“const __restrict” indicates eligibility

Tex

SMX

L2

TexTexTex

Read-onlyData Cache

Page 55: GPU Parallel Computing

© NVIDIA Corporation 2013

const __restrict

Annotate eligible kernel parameters withconst __restrict

Compiler will automatically map loads to use read-only data cache path

__global__ void saxpy(float x, float y, const float * __restrict input, float * output){ size_t offset = threadIdx.x + (blockIdx.x * blockDim.x);

// Compiler will automatically use texture // for "input" output[offset] = (input[offset] * x) + y;}

Page 56: GPU Parallel Computing

© NVIDIA Corporation 2013

References

ManualsProgramming GuideBest Practice Guide

BooksCUDA By Examples, Tsinghua University Press

Training videosGTC talks online: optimization, advanced optimization + hundreds of other GPU computing talkshttp://www.gputechconf.com/gtcnew/on-demand-gtc.phpNVIDIA GPU Computing webinarshttp://developer.nvidia.com/gpu-computing-webinars

Forumhttp://cudazone.nvidia.cn/forum/forum.php