Upload
medennemri
View
230
Download
0
Embed Size (px)
Citation preview
8/7/2019 Luebke Cuda Fundamentals
1/85
Beyond Programmable Shading: In Action
CUDA FundamentalsCUDA Fundamentals
David Luebke, NVIDIA ResearchDavid Luebke, NVIDIA Research
8/7/2019 Luebke Cuda Fundamentals
2/85
Beyond Programmable Shading: In Action
Computers no longer get faster, just wider
Many people have not yet gotten this memo
You must re-think your algorithms to be parallel ! Not just a good idea the only way to gain performance
Otherwise: if its not fast enough now, it never will be
Data-parallel computing is most scalable solution
Otherwise: refactor code for 2 cores
You will always have more data than cores build the computation around the data
8 cores4 cores
TheThe NewNew MooreMoores Laws Law
16 cores
8/7/2019 Luebke Cuda Fundamentals
3/85
Beyond Programmable Shading: In Action
Enter the GPUEnter the GPU
Massive economies of scale
Massively parallel
8/7/2019 Luebke Cuda Fundamentals
4/85
Beyond Programmable Shading: In Action
Enter CUDAEnter CUDA
CUDA is a scalable parallel programming model and a
software environment for parallel computing Minimal extensions to familiar C/C++ environment
Heterogeneous serial-parallel programming model
NVIDIAs GPU architecture accelerates CUDA Expose the computational horsepower of NVIDIA GPUs
Enable general-purpose GPU computing
(CUDA also maps well to multicore CPUs)
8/7/2019 Luebke Cuda Fundamentals
5/85
Beyond Programmable Shading: In Action
The DemocratizationThe Democratization
of Parallel Computingof Parallel Computing GPU Computing with CUDA brings parallel
computing to the masses
Over 85,000,000 CUDA-capable GPUs sold A developer kit costs ~$100 (for 500 GFLOPS!)
Data-parallel supercomputers are everywhere CUDA makes this power accessible
Were already seeing innovations in data-parallel
computing
Massively parallel computing has become a
commodity technology
8/7/2019 Luebke Cuda Fundamentals
6/85
Beyond Programmable Shading: In Action
CUDA Programming ModelCUDA Programming Model
8/7/2019 Luebke Cuda Fundamentals
7/85
Beyond Programmable Shading: In Action
Some Design GoalsSome Design Goals
Scale to 100s of cores, 1000s of parallel threads
Let programmers focus on parallel algorithms
not mechanics of a parallel programming language
Enable heterogeneous systems (i.e., CPU+GPU) CPU good at serial computation, GPU at parallel
8/7/2019 Luebke Cuda Fundamentals
8/85
Beyond Programmable Shading: In Action
Key Parallel Abstractions in CUDAKey Parallel Abstractions in CUDA
0. Zillions of lightweight threads Simple decomposition model
1. Hierarchy of concurrent threads Simple execution model
2. Lightweight synchronization primitives Simple synchronization model
3. Shared memory model for cooperating threads Simple communication model
8/7/2019 Luebke Cuda Fundamentals
9/85
Beyond Programmable Shading: In Action
Hierarchy of concurrent threadsHierarchy of concurrent threads
Parallel kernels composed of many threads all threads execute the same sequential program
Threads are grouped into thread blocks threads in the same block can cooperate
Threads/blockshave unique IDs
Thread t
t0 t1 tBBlock b
Kernel foo()
. . .
8/7/2019 Luebke Cuda Fundamentals
10/85
Beyond Programmable Shading: In Action
Heterogeneous ProgrammingHeterogeneous Programming
CUDA = serial program with parallel kernels, all in C Serial C code executes in a CPU thread
Parallel kernel C code executes in thread blocks
across multiple processing elements
Serial Code
. . .
. . .
Parallel Kernel
foo>(args);
Serial Code
Parallel Kernel
bar>(args);
8/7/2019 Luebke Cuda Fundamentals
11/85
Beyond Programmable Shading: In Action
Example: Vector Addition KernelExample: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x +blockDim.x *blockIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
// Run N/256 blocks of 256 threads each
vecAdd>(d_A, d_B, d_C);
}
Device Code
8/7/2019 Luebke Cuda Fundamentals
12/85
Beyond Programmable Shading: In Action
Example: Vector Addition KernelExample: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x +blockDim.x *blockIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
// Run N/256 blocks of 256 threads each
vecAdd>(d_A, d_B, d_C);
}
Host Code
8/7/2019 Luebke Cuda Fundamentals
13/85
Beyond Programmable Shading: In Action
Synchronization & CoordinationSynchronization & Coordination
Threads within block may synchronize withbarriers
Step 1 __syncthreads();
Step 2
Blocks coordinate via atomic memory operations e.g., increment shared queue pointer with atomicInc()
Implicit barrier between dependent kernelsvec_minus(a, b, c);
vec_dot(c, c);
8/7/2019 Luebke Cuda Fundamentals
14/85
Beyond Programmable Shading: In Action
Threads & Thread BlocksThreads & Thread Blocks
What is a thread?
An independent thread of execution
Has its own PC, variables (registers), processor state, etc.
What is a thread block?
A virtualized multiprocessor Fit processors to data & computation, each kernel launch
A (data) parallel task all blocks in kernel have the same entry point
but may execute any code they want
8/7/2019 Luebke Cuda Fundamentals
15/85
Beyond Programmable Shading: In Action
Blocks Must Be IndependentBlocks Must Be Independent
Any possible interleaving of blocks should be valid
presumed to run to completion without pre-emption
can run in any order can run concurrently OR sequentially
Blocks may coordinate but not synchronize
shared queue pointer: OK
shared lock: BAD can easily deadlock
Independence requirement gives scalability
8/7/2019 Luebke Cuda Fundamentals
16/85
Beyond Programmable Shading: In Action
Blocks Run on MultiprocessorsBlocks Run on Multiprocessors
Kernel launched by host
. ..
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
. . .
Device processor array
Device Memory
8/7/2019 Luebke Cuda Fundamentals
17/85
Beyond Programmable Shading: In Action
Levels of ParallelismLevels of Parallelism
Thread parallelism each thread is an independent thread of execution
Data parallelism across threads in a block
across blocks in a kernel
Task parallelism different blocks are independent
independent kernels
8/7/2019 Luebke Cuda Fundamentals
18/85
Beyond Programmable Shading: In Action
Memory modelMemory model
Thread
Per-thread
Local Memory
Block
Per-block
SharedMemory
8/7/2019 Luebke Cuda Fundamentals
19/85
Beyond Programmable Shading: In Action
Memory modelMemory model
Kernel 0
. . .Per-device
GlobalMemory
Kernel 1
SequentialKernels
. . .
8/7/2019 Luebke Cuda Fundamentals
20/85
Beyond Programmable Shading: In Action
Memory modelMemory model
Device 0memory
Device 1memory
Host
memory
cudaMemcpy()
8/7/2019 Luebke Cuda Fundamentals
21/85
Beyond Programmable Shading: In Action
CUDA SDKCUDA SDK
NVIDIA C Compiler
NVIDIA Assemblyfor Computing
CPU Host Code
Integrated CPUand GPU C Source Code
Libraries:FFT, BLAS,Example Source Code
CUDA
Driver
Debugger
Profiler Standard C Compiler
GPU CPU
8/7/2019 Luebke Cuda Fundamentals
22/85
Beyond Programmable Shading: In Action
Graphics InteroperabilityGraphics Interoperability
Go Beyond Programmable Shading
Access textures from within CUDA Free filtering, wrapping, 2/3D addressing, spatial caching...
Share resources (vertex, pixel, texture, surface) Map resources into CUDA global memory, manipulate freely
Geometry: physics (collision, response, deformation, ...),
lighting, geometry creation, non-traditional rasterization,acceleration structures, AI, ...
Imaging: image processing, (de)compression, procedural texturegeneration, postprocess effects, computer vision, ...
8/7/2019 Luebke Cuda Fundamentals
23/85
Beyond Programmable Shading: In Action
SummarySummary
CUDA is a powerful parallel programming model
Heterogeneous - mixed serial-parallel programming Scalable - hierarchical thread execution model
Accessible - minimal but expressive changes to C
Interoperable - simple graphics interop mechanisms
CUDA is an attractive platform Broad - OpenGL, DirectX, WinXP, Vista, Linux, MacOS
Widespread - over 85M CUDA GPUs, 60K CUDA developers
CUDA provides t remendous scope for innovat ivegraphics research beyond programmable shading
8/7/2019 Luebke Cuda Fundamentals
24/85
Beyond Programmable Shading: In Action
Questions?Questions?
David LuebkeDavid [email protected]@nvidia.com
8/7/2019 Luebke Cuda Fundamentals
25/85
Beyond Programmable Shading: In Action
CUDACUDA
Design ChoicesDesign Choices
8/7/2019 Luebke Cuda Fundamentals
26/85
Beyond Programmable Shading: In Action
Goal: ScalabilityGoal: Scalability
Scalable execution Program must be insensitive to the number of cores
Write one program for any number of SM cores
Program runs on any size GPU without recompiling
Hierarchical execution model
Decompose problem into sequential steps (kernels) Decompose kernel into computing parallel blocks
Decompose block into computing parallel threads
Hardware distributes independentblocks to SMsas available
8/7/2019 Luebke Cuda Fundamentals
27/85
Beyond Programmable Shading: In Action
Blocks Run on MultiprocessorsBlocks Run on Multiprocessors
Kernel launched by host
. ..
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
. . .
Device processor array
Device Memory
8/7/2019 Luebke Cuda Fundamentals
28/85
Beyond Programmable Shading: In Action
Goal: easy to programGoal: easy to program
Strategies: Familiar programming language mechanics
C/C++ with small extensions
Simple parallel abstractions
Simple barrier synchronization Shared memory semantics
Hardware-managed hierarchy of threads
8/7/2019 Luebke Cuda Fundamentals
29/85
Beyond Programmable Shading: In Action
Hardware MultithreadingHardware Multithreading
Hardware allocates resources to blocks
blocks need: thread slots, registers, shared memory
blocks dont run until resources are available
Hardware schedules threads
threads have their own registers any thread not waiting for something can run
context switching is (basically) free every cycle
Hardware relies on threads to hide latency i.e., parallelism is necessary for performance
8/7/2019 Luebke Cuda Fundamentals
30/85
Beyond Programmable Shading: In Action
GraphicsGraphics InteropInterop
DetailsDetails
8/7/2019 Luebke Cuda Fundamentals
31/85
Beyond Programmable Shading: In Action
OpenGLOpenGL InteropInterop StepsSteps
Register a buffer object with CUDA cudaGLRegisterBufferObject(GLuint buffObj);
OpenGL can use a registered buffer only as a source Unregister the buffer prior to rendering to it by OpenGL
Map the buffer object to CUDA memory cudaGLMapBufferObject(void **devPtr, GLuint buffObj);
Returns an address in global memory Buffer must registered prior to mapping
Launch a CUDA kernel to process the buffer
Unmap the buffer object prior to use by OpenGL cudaGLUnmapBufferObject(GLuint buffObj);
Unregister the buffer object cudaGLUnregisterBufferObject(GLuint buffObj);
Optional: needed if the buffer is a render target
Use the buffer object in OpenGL code
InteropInterop ScenarioScenario:
8/7/2019 Luebke Cuda Fundamentals
32/85
Beyond Programmable Shading: In Action
InteropInterop Scenario:Scenario:Dynamic CUDADynamic CUDA--generated texturegenerated texture
Register the texture PBO with CUDA
For each frame:
Map the buffer
Generate the texture in a CUDA kernel
Unmap the buffer
Update the texture
Render the textured object
unsigned char *p_d=0;
cudaGLMapBufferObject((void**)&p_d, pbo);
prepTexture(p_d, time);
cudaGLUnmapBufferObject(pbo);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo);
glBindTexture(GL_TEXTURE_2D, texID);
glTexSubImage2D(GL_TEXTURE_2D, 0, 0,0, 256,256,
GL_BGRA, GL_UNSIGNED_BYTE, 0);
I tI t S iS i
8/7/2019 Luebke Cuda Fundamentals
33/85
Beyond Programmable Shading: In Action
InteropInterop Scenario:Scenario:Frame PostFrame Post--processing by CUDAprocessing by CUDA
For each frame:
Render to PBO with OpenGL Register the PBO with CUDA
Map the buffer
Process the buffer with a CUDA kernel
Unmap the buffer Unregister the PBO from CUDA
unsigned char *p_d=0;
cudaGLRegisterBufferObject(pbo);
cudaGLMapBufferObject((void**)&p_d, pbo);postProcess(p_d);
cudaGLUnmapBufferObject(pbo);
cudaGLUnregisterBufferObject(pbo);
...
8/7/2019 Luebke Cuda Fundamentals
34/85
Beyond Programmable Shading: In Action
Texture Fetch in CUDATexture Fetch in CUDA
Advantages of using texture: Texture fetches are cached
optimized for 2D locality (swizzled)
Textures are addressable in 2D using integer or normalized coordinates means fewer addressing calculations in code
Get filtering for free bilinear interpolation, anisotropic filtering etc.
Free wrap modes (boundary conditions) clamp to edge
repeat
Disadvantages: Textures are read-only May not be a big performance advantage if global reads are already
coalesced
8/7/2019 Luebke Cuda Fundamentals
35/85
Beyond Programmable Shading: In Action
Texture Fetch in CUDATexture Fetch in CUDA
Textures represent 1D, 2D or 3D arrays of memory
CUDA texture type defines type of components anddimension:
texture
texfetch function performs texture fetch:
texfetch(texture t,
float x, float y=0, float z=0)
8/7/2019 Luebke Cuda Fundamentals
36/85
Beyond Programmable Shading: In Action
Channel DescriptorsChannel Descriptors
Channel descriptor is used in conjunction withtexture and describes how reads from globalmemory are interpreted
Predefined channel descriptors exist for built-intypes, e.g.:
cudaChannelFormatDesc
cudaCreateChannelDesc(void);
cudaChannelFormatDesc
cudaCreateChannelDesc(void);
cudaChannelFormatDesc
cudaCreateChannelDesc(void);
8/7/2019 Luebke Cuda Fundamentals
37/85
Beyond Programmable Shading: In Action
ArraysArrays
Arrays are the host representation of n-dimensional textures
To allocate an arraycudaError_t
cudaMalloc( cudaArray_t **arrayPtr,struct cudaChannelFormatDesc &desc,
size_t width,
size_t height = 1);
To free an array:cudaError_t
cudaFree( struct cudaArray *array);
8/7/2019 Luebke Cuda Fundamentals
38/85
Beyond Programmable Shading: In Action
Loading Memory from ArraysLoading Memory from Arrays
Load memory to arrays using a version ofcudaMemcpy:
cudaError_t cudaMemcpy(
void *dst,
const struct cudaArray *src,
size_t count,
enum cudaMemcpyKind kind
)
8/7/2019 Luebke Cuda Fundamentals
39/85
Beyond Programmable Shading: In Action
Binding TexturesBinding Textures
Textures are bound to arrays using:
templatecudaError_t cudaBindTexture(
const struct texture &tex,
const struct cudaArray *array
)
To unbind:
template
cudaError_t cudaUnbindTexture(const struct texture &tex
)
8/7/2019 Luebke Cuda Fundamentals
40/85
Beyond Programmable Shading: In Action
ExampleExamplecudaArray* cu_array;
texture tex;
// Allocate array
cudaMalloc( &cu_array, cudaCreateChannelDesc(),
width, height );
// Copy image data to array
cudaMemcpy( cu_array, image, width*height, cudaMemcpyHostToDevice);
// Bind the array to the texture
cudaBindTexture( tex, cu_array);
// Run kernel
dim3 blockDim(16, 16, 1);
dim3 gridDim(width / blockDim.x, height / blockDim.y, 1);
kernel>(d_odata, width, height);
cudaUnbindTexture(tex);
8/7/2019 Luebke Cuda Fundamentals
41/85
Beyond Programmable Shading: In Action
Example (cont)Example (cont)__global__ void
kernel(float* odata, int height, int width)
{
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;float c = texfetch(tex, x, y);
odata[y*width+x] = c;
}
8/7/2019 Luebke Cuda Fundamentals
42/85
Beyond Programmable Shading: In Action
CUDACUDA
ProgrammingProgramming
8/7/2019 Luebke Cuda Fundamentals
43/85
Beyond Programmable Shading: In Action
GPU Computing with CUDAGPU Computing with CUDA
CUDA = Compute Unified Driver Architecture
Co-designed hardware & software for direct GPU
computing
Hardware: fully general data-parallelarchitecture
Software: program the GPU in C
General thread launch
Global load-store
Parallel data cache
Scalar architecture
Integers, bit operations
Double precision (shortly)
Scalable data-parallelexecution/memory model
C with minimal yetpowerful extensions
C DA SD
8/7/2019 Luebke Cuda Fundamentals
44/85
Beyond Programmable Shading: In Action
CUDA SDKCUDA SDK
NVIDIA C Compiler
NVIDIA Assemblyfor Computing
CPU Host Code
Integrated CPUand GPU C Source Code
Libraries:FFT, BLAS,Example Source Code
CUDA
Driver
Debugger
ProfilerStandard C Compiler
GPU CPU
CUDA F il bl k lCUDA F il bl k l
8/7/2019 Luebke Cuda Fundamentals
45/85
Beyond Programmable Shading: In Action
CUDA: Features available to kernelsCUDA: Features available to kernels
Standard mathematical functionssinf, powf, atanf, ceil, etc.
Built-in vector typesfloat4, int4, uint4, etc. for dimensions 1..4
Texture accesses in kernelstexture my_texture; // declare texture reference
float4 texel = texfetch(my_texture, u, v);
G8 CUDA C i h E iG8 CUDA C ith E t i
8/7/2019 Luebke Cuda Fundamentals
46/85
Beyond Programmable Shading: In Action
G8x CUDA = C with ExtensionsG8x CUDA = C with Extensions
Philosophy: provide minimal set of extensions necessary to expose power
Function qualifiers:__global__ void MyKernel() { }
__device__ float MyDeviceFunc() { }
Variable qualifiers:__constant__ float MyConstantArray[32];
__shared__ float MySharedArray[32];
Execution configuration:dim3 dimGrid(100, 50); // 5000 thread blocks
dim3 dimBlock(4, 8, 8); // 256 threads per block
MyKernel > (...); // Launch kernel
Built-in variables and functions valid in device code:dim3 gridDim; // Grid dimensiondim3 blockDim; // Block dimension
dim3 blockIdx; // Block index
dim3 threadIdx; // Thread index
void__syncthreads(); // Thread synchronization
CUDA R ti tCUDA R ti t
8/7/2019 Luebke Cuda Fundamentals
47/85
Beyond Programmable Shading: In Action
CUDA: Runtime supportCUDA: Runtime support
Explicit memory allocation returns pointers to GPU memory
cudaMalloc(), cudaFree()
Explicit memory copy for host
device, device
devicecudaMemcpy(), cudaMemcpy2D(), ...
Texture management
cudaBindTexture(), cudaBindTextureToArray(), ...
OpenGL & DirectX interoperability
cudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(),
E l V t Additi K lE l V t Additi K l
8/7/2019 Luebke Cuda Fundamentals
48/85
Beyond Programmable Shading: In Action
Example: Vector Addition KernelExample: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x +blockDim.x *blockIdx.x;
C[i] = A[i] + B[i];
}
E l I ki g th K lExample: Invoking the Kernel
8/7/2019 Luebke Cuda Fundamentals
49/85
Beyond Programmable Shading: In Action
Example: Invoking the KernelExample: Invoking the Kernel
__global__ void vecAdd(float* A, float* B, float* C);
void main()
{
// Execute on N/256 blocks of 256 threads each
vecAdd>(d_A, d_B, d_C);
}
Example: Host code for memoryExample: Host code for memory
8/7/2019 Luebke Cuda Fundamentals
50/85
Beyond Programmable Shading: In Action
Example: Host code for memoryExample: Host code for memory
// allocate host (CPU) memory
float* h_A = (float*) malloc(N * sizeof(float));
float* h_B = (float*) malloc(N * sizeof(float));
initalize h_A and h_B
// allocate device (GPU) memory
float* d_A, d_B, d_C;
cudaMalloc( (void**) &d_A, N * sizeof(float));
cudaMalloc( (void**) &d_B, N * sizeof(float));
cudaMalloc( (void**) &d_C, N * sizeof(float));
// copy host memory to device
cudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice));
cudaMemcpy( d_B, h_B, N * sizeof(float),cudaMemcpyHostToDevice));
// execute the kernel on N/256 blocks of 256 threads each
vecAdd(d_A, d_B, d_C);
A quick reviewA quick review
8/7/2019 Luebke Cuda Fundamentals
51/85
Beyond Programmable Shading: In Action
A quick reviewA quick review
device = GPU = set of multiprocessors
Multiprocessor = set of processors & shared memory
Kernel = GPU program Grid = array of thread blocks that execute a kernel
Thread block = group of SIMD threads that execute a
kernel and can communicate via shared memoryMemory Location Cached Access Who
Local Off-chip No Read/write One thread
Shared On-chip N/A - resident Read/write All threads in a block
Global Off-chip No Read/write All threads + host
Constant Off-chip Yes Read All threads + host
Texture Off-chip Yes Read All threads + host
8/7/2019 Luebke Cuda Fundamentals
52/85
Beyond Programmable Shading: In Action
Myths ofMyths of
GPU ComputingGPU Computing
Myths of GPU ComputingMyths of GPU Computing
8/7/2019 Luebke Cuda Fundamentals
53/85
Beyond Programmable Shading: In Action
Myths of GPU ComputingMyths of GPU Computing
GPUs layer normal programs on top of graphics NO: CUDA compiles directly to the hardware
GPUs architectures are: Very wide (1000s) SIMD machines
on which branching is impossible or prohibitive
with 4-wide vector registers.
GPUs are power-inefficient
GPUs dont do real floating point
Myths of GPU ComputingMyths of GPU Computing
8/7/2019 Luebke Cuda Fundamentals
54/85
Beyond Programmable Shading: In Action
Myths of GPU ComputingMyths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are: Very wide (1000s) SIMD machines
on which branching is impossible or prohibitive
with 4-wide vector registers.
GPUs are power-inefficient
GPUs dont do real floating point
Myths of GPU ComputingMyths of GPU Computing
8/7/2019 Luebke Cuda Fundamentals
55/85
Beyond Programmable Shading: In Action
Myths of GPU ComputingMyths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are: Very wide (1000s) SIMD machines NO: warps are 32-wide
on which branching is impossible or prohibitive
with 4-wide vector registers.
GPUs are power-inefficient
GPUs dont do real floating point
Myths of GPU ComputingMyths of GPU Computing
8/7/2019 Luebke Cuda Fundamentals
56/85
Beyond Programmable Shading: In Action
Myths of GPU ComputingMyths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are: Very wide (1000s) SIMD machines
on which branching is impossible or prohibitive NOPE
with 4-wide vector registers.
GPUs are power-inefficient
GPUs dont do real floating point
Myths of GPU ComputingMyths of GPU Computing
8/7/2019 Luebke Cuda Fundamentals
57/85
Beyond Programmable Shading: In Action
Myths of GPU ComputingMyths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are: Very wide (1000s) SIMD machines
on which branching is impossible or prohibitive
with 4-wide vector registers.
GPUs are power-inefficient
GPUs dont do real floating point
Myths of GPU ComputingMyths of GPU Computing
8/7/2019 Luebke Cuda Fundamentals
58/85
Beyond Programmable Shading: In Action
Myths of GPU ComputingMyths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are: Very wide (1000s) SIMD machines
on which branching is impossible or prohibitive
with 4-wide vector registers. NO: scalar thread processors
GPUs are power-inefficient
GPUs dont do real floating point
Myths of GPU ComputingMyths of GPU Computing
8/7/2019 Luebke Cuda Fundamentals
59/85
Beyond Programmable Shading: In Action
Myths of GPU ComputingMyths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are: Very wide (1000s) SIMD machines
on which branching is impossible or prohibitive
with 4-wide vector registers.
GPUs are power-inefficient
GPUs dont do real floating point
Myths of GPU ComputingMyths of GPU Computing
8/7/2019 Luebke Cuda Fundamentals
60/85
Beyond Programmable Shading: In Action
Myths of GPU ComputingMyths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are: Very wide (1000s) SIMD machines
on which branching is impossible or prohibitive
with 4-wide vector registers.
GPUs are power-inefficient:No 4-10x perf/W advantage, up to 89x reported for some studies
GPUs dont do real floating point
Myths of GPU ComputingMyths of GPU Computing
8/7/2019 Luebke Cuda Fundamentals
61/85
Beyond Programmable Shading: In Action
Myths of GPU ComputingMyths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are: Very wide (1000s) SIMD machines
on which branching is impossible or prohibitive
with 4-wide vector registers.
GPUs are power-inefficient:
GPUs dont do real floating point
Double Precision Floating Point
8/7/2019 Luebke Cuda Fundamentals
62/85
Beyond Programmable Shading: In Action
g
GPU Floating Point FeaturesGPU Floating Point Features
8/7/2019 Luebke Cuda Fundamentals
63/85
Beyond Programmable Shading: In Action
ggG80 SSE IBM Altivec Cell SPE
Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754
Rounding modes forFADD and FMUL
Round to nearest andround to zero
All 4 IEEE, round tonearest, zero, inf, -inf
Round to nearestonly
Round tozero/truncate only
Denormal handling Flush to zero
Supported,
1000s of cycles
Supported,
1000s of cycles Flush to zero
NaN support Yes Yes Yes No
Overflow and Infinitysupport
Yes, only clamps tomax norm
Yes Yes No, infinity
Flags No Yes Yes Some
Square root Software only Hardware Software only Software only
Division Software only Hardware Software only Software only
Reciprocal estimate
accuracy 24 bit 12 bit 12 bit 12 bit
Reciprocal sqrtestimate accuracy
23 bit 12 bit 12 bit 12 bit
log2(x) and 2^xestimates accuracy
23 bit No 12 bit No
DoDo GPUsGPUs Do Real IEEE FP?Do Real IEEE FP?
8/7/2019 Luebke Cuda Fundamentals
64/85
Beyond Programmable Shading: In Action
G8x GPU FP is IEEE 754 Comparable to other processors / accelerators
More precise / usable in some ways Less precise in other ways
GPU FP getting better every generation Double precision support in GTX 20
Goal: best of class by 2009
8/7/2019 Luebke Cuda Fundamentals
65/85
Beyond Programmable Shading: In Action
ApplicationsApplications
&&
Sweet SpotsSweet Spots
8/7/2019 Luebke Cuda Fundamentals
66/85
Beyond Programmable Shading: In Action
GPU Computing:Motivation110-240X
45X100X
35X
17X
13457x
GPUsGPUs Are Fast & Getting FasterAre Fast & Getting Faster
8/7/2019 Luebke Cuda Fundamentals
67/85
Beyond Programmable Shading: In Action
gg
0
250
500
750
1000
Sep-02 Jan-04 May-05 Oct-06 Feb-08
PeakG
FLOP/s
NVIDIA GPU Intel CPU
GPU Computing Sweet SpotsGPU Computing Sweet Spots
8/7/2019 Luebke Cuda Fundamentals
68/85
Beyond Programmable Shading: In Action
Applications:
High arithmetic intensity:
Dense linear algebra, PDEs, n-body, finite difference,
High bandwidth:Sequencing (virus scanning, genomics), sorting, database
Visual computing:Graphics, image processing, tomography, machine vision
GPU Computing Example MarketsGPU Computing Example Markets
8/7/2019 Luebke Cuda Fundamentals
69/85
Beyond Programmable Shading: In Action
Computational
Modeling
Computational
Chemistry
Computational
Medicine
Computational
Science
Computational
Biology
Computational
Finance
Computational
Geoscience
Image
Processing
GPU Computing Sweet SpotsGPU Computing Sweet Spots
8/7/2019 Luebke Cuda Fundamentals
70/85
Beyond Programmable Shading: In Action
From cluster to workstation
The personal supercomputing phase change From lab to clinic
From machine room to engineer, grad student desks From batch processing to interactive
From interactive to real-time
GPU-enabled clusters
A 100x or better speedup changes the science
Solve at different scales Direct brute-force methods may outperform cleverness
New bottlenecks may emerge
Approaches once inconceivable may become practical
ApplicationsApplications -- CondensedCondensed
8/7/2019 Luebke Cuda Fundamentals
71/85
Beyond Programmable Shading: In Action
3D image analysis
Adaptive radiation therapy
Acoustics
Astronomy
Audio
Automobile vision Bioinfomatics
Biological simulation
Broadcast
Cellular automata
Computational Fluid Dynamics
Computer Vision
Cryptography
CT reconstruction
Data Mining
Digital cinema/projections
Electromagnetic simulation
Equity training
Film
Financial - lots of areas
Languages
GIS
Holographics cinema
Imaging (lots) Mathematics research
Military (lots)
Mine planning
Molecular dynamics
MRI reconstruction
Multispectral imaging
nbody
Network processing
Neural network
Oceanographic research
Optical inspection
Particle physics
Protein folding
Quantum chemistry
Ray tracing
Radar
Reservoir simulation
Robotic vision/AI
Robotic surgery
Satellite data analysis
Seismic imaging
Surgery simulation
Surveillance
Ultrasound
Video conferencing
Telescope
Video
Visualization
Wireless
X-ray
Tesla HPC PlatformTesla HPC Platform
8/7/2019 Luebke Cuda Fundamentals
72/85
Beyond Programmable Shading: In Action
HPC-oriented product line C870: board (1 GPU)
D870: deskside unit (2 GPUs) S870: 1u server unit (4 GPUs)
NVIDIA Decoder Ring
8/7/2019 Luebke Cuda Fundamentals
73/85
Beyond Programmable Shading: In Action
TeslaTMHigh Performance Computing
QuadroDesign & Creation
GeForceEntertainment
Architecture: TESLA
Chips: G80, G84, G92,
New ApplicationsNew Applications
8/7/2019 Luebke Cuda Fundamentals
74/85
Beyond Programmable Shading: In Action
Real-time options implied volatility engine
Swaption volatility cube calculator
Manifold 8 GIS
Ultrasound imaging
HOOMD Molecular Dynamics
Also
Image rotation/classification
Graphics processing toolbox
Microarray data analysis
Data parallel primitives
Astrophysics simulations
SDK: Mandelbrot, computer visi
Seismic migration
GPU Computing: Biomedical ImagingGPU Computing: Biomedical Imaging
8/7/2019 Luebke Cuda Fundamentals
75/85
Beyond Programmable Shading: In Action
Representative PublicationsRepresentative Publications
Computed Tomography:
K. Mueller, F. Xu, and N. Neophytou. Why do commodity graphicshardware boards (GPUs) work so well for acceleration of
computed tomography? SPIE Electronic Imaging 2007,Computational Imaging V Keynote, 2007.
G. C. Sharp, N. Kandasamy, H. Singh, and M. Folkert, GPU-basedstreaming architectures for fast cone-beam CT imagereconstruction and Demons deformable registration,Physics in Medicine and Biology, 2007.
H. Scherl, B. Keck, M. Kowarschik, and J. Hornegger. Fast GPU-Based CT Reconstruction using the Common Unified DeviceArchitecture (CUDA), 2007 Nuclear Science Symposium and
Medical Imaging Conference, Honolulu HW.
X. Xue, A. Cheryauka, and D. Tubbs. Acceleration of fluoro-CTreconstruction for a mobile C-Arm on GPU and FPGAhardware: A simulation study. SPIE Medical Imaging 2006.
K. Chidlow and T. Moller. Rapid emission tomographyreconstruction. Int'l Workshop on Volume Graphics, 2003.
K. Mueller and R. Yagel. Rapid 3-D cone-beam reconstruction withthe simultaneous algebraic reconstruction technique (SART)
using 2-D texture mapping hardware. IEEE Transactions onMedical Imaging, 19(12):1227-1237, 2000.
Magnetic Resonance Imaging:
S. Stone, J. Haldar, S. Tsao, W. Hwu, Z. Liang, B. Sutton.Accelerating Advanced MRI Reconstructions on GPUs, ACMFrontiers in Computing, Ischia, Italy, 2008.
W. Jeong, P. T. Fletcher, R. Tao, and R.T. Whitaker, InteractiveVisualization of Volumetric White Matter Connectivity inDT-MRI Using a Parallel-Hardware Hamilton-Jacobi Solver,Proc. IEEE Visualization 2007, 2007.
T. Schiwietz, T. Chang, P. Speier, and R. Westermann. MR imagereconstruction using the GPU. SPIE Medical Imaging 2006.
T. Sorensen, T. Schaeter, K. Noe, and M. Hansen. Accelerating thenon-equispaced fast Fourier transform on commoditygraphics hardware. IEEE Transactions on Medical Imaging.
K. Mueller and R. Yagel. Rapid 3-D cone-beam reconstruction withthe simultaneous algebraic reconstruction technique (SART)using 2-D texture mapping hardware. IEEE Transactions onMedical Imaging, 19(12):1227-1237, 2000.
Registration:
C. Vetter and C. Guetter and C. Xu and R. Westermann, Non-rigidmulti-modal registration on the GPU, Medical Imaging 2007:Image Processing, SPIE, vol. 6512, Mar 2007.
The Future ofThe Future of GPUsGPUs
8/7/2019 Luebke Cuda Fundamentals
76/85
Beyond Programmable Shading: In Action
GPU Computing drives new applications Reducing Time to Discovery
100x Speedup changes science and research methods
New applications drive the future of GPUs andGPU Computing Drives new GPU capabilities
Drives hunger for more performance
Some exciting new domains:
Vision, acoustic, and embedded applications Large-scale simulation & physics
8/7/2019 Luebke Cuda Fundamentals
77/85
Beyond Programmable Shading: In Action
Parallel ComputingParallel Computing
ParallelParallel ComputingComputingss Golden AgeGolden Age
8/7/2019 Luebke Cuda Fundamentals
78/85
Beyond Programmable Shading: In Action
1980s, early `90s: a golden age for parallel computing Particularly data-parallel computing
Architectures Connection Machine, MasPar, Cray
True supercomputers: incredibly exotic, powerful, expensive
Algorithms, languages, & programming models Solved a wide variety of problems
Various parallel algorithmic models developed P-RAM, V-RAM, circuit, hypercube, etc.
ParallelParallel ComputingComputingss Dark AgeDark Age
8/7/2019 Luebke Cuda Fundamentals
79/85
Beyond Programmable Shading: In Action
Butimpact of data-parallel computing limited
Thinking Machines sold 7 CM-1s (100s of systems total)
MasPar sold ~200 systems
Commercial and research activity subsided Massively-parallel machines replaced by clusters
of ever-more powerful commodity microprocessors
Beowulf, Legion, grid computing,
Massively parallel computing lost momentum tothe inexorable advance of commodity technology
8/7/2019 Luebke Cuda Fundamentals
80/85
Beyond Programmable Shading: In Action
GPU DesignGPU Design
CPU/GPU ParallelismCPU/GPU Parallelism
8/7/2019 Luebke Cuda Fundamentals
81/85
Beyond Programmable Shading: In Action
Moores Law gives you more and more transistors What do you want to do with them?
CPU strategy: make the workload (one compute thread) run as fast aspossible
Tactics:
Cache (area limiting)
Instruction/Data prefetch
Speculative execution
limited by perimeter communication bandwidth
then add task parallelismmulti-core
GPU strategy: make the workload (as many threads as possible) run as fastas possible
Tactics:
Parallelism (1000s of threads)
Pipelining limited by area compute capability
Hardware Implementation:Hardware Implementation:A Set of SIMD MultiprocessorsA Set of SIMD Multiprocessors
8/7/2019 Luebke Cuda Fundamentals
82/85
Beyond Programmable Shading: In Action
pp
The device is a set ofmultiprocessors
Each multiprocessor is aset of 32-bit processorswith a Single InstructionMultiple Data architecture
At each clock cycle, amultiprocessor executesthe same instruction on agroup of threads called a
warp The number of threads in a
warp is the warp size
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
InstructionUnit
Processor 1
Processor 2 Processor M
Streaming Multiprocessor (SM)Streaming Multiprocessor (SM)
8/7/2019 Luebke Cuda Fundamentals
83/85
Beyond Programmable Shading: In Action
Processing elements 8 scalar thread processors (SP)
32 GFLOPS peak at 1.35 GHz
8192 32-bit registers (32KB) MB total register file space!
usual ops: float, int, branch,
Hardware multithreading up to 8 blocks resident at once
up to 768 active threads in total
16KB on-chip memory low latency storage
shared amongst threads of a block
supports thread communication
SP
SharedMemory
MT IU
SM t0 t1 tB
Hardware Implementation:Hardware Implementation:
8/7/2019 Luebke Cuda Fundamentals
84/85
Beyond Programmable Shading: In Action
Memory ArchitectureMemory Architecture
The device has local
device memory Can be read and written by
the host and by themultiprocessors
Each multiprocessor has: A set of 32-bit registers per
processor
on-chip shared memory
A read-only constant cache A read-only texture cache
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
Device memory
Shared Memory
Instruction
UnitProcessor 1
Registers
Processor 2
Registers
Processor M
Registers
ConstantCache
Texture
Cache
Hardware Implementation:Hardware Implementation:
8/7/2019 Luebke Cuda Fundamentals
85/85
Beyond Programmable Shading: In Action
Memory ModelMemory Model
Each thread can:
Read/write per-blockon-chip shared memory
Read per-grid cachedconstant memory
Read/write non-cacheddevice memory:
Per-grid global memory
Per-thread local memory Read cached texture
memory
Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers