View
245
Download
1
Category
Tags:
Preview:
Citation preview
GPU Architecture and Programming
GPU vs CPUhttps://www.youtube.com/watch?v=fKK933KK6Gg
GPU Architecture
• GPU (Graphics Processing Unit) were originally designed as graphics accelerators, used for real-time graphics rendering.
• Starting in the late 1990s, the hardware became increasingly programmable, culminating in NVIDIA's first GPU in 1999.
• CPU + GPU is a powerful combination – CPUs consist of a few cores optimized for serial processing, – GPUs consist of thousands of smaller, more efficient cores
designed for parallel performance. – Serial portions of the code run on the CPU while parallel
portions run on the GPU
Architecture of GPU
Image copied from http://www.pgroup.com/lit/articles/insider/v2n1a5.htm Image copied from http://people.maths.ox.ac.uk/~gilesm/hpc/NVIDIA/NVIDIA_CUDA_Tutorial_No_NDA_Apr08.pdf
CUDA Programming
• CUDA (Compute Unified Device Architecture) is a parallel programming platform created by NVIDIA based on its GPUs.
• By using CUDA, you can write programs that directly access GPU.
• CUDA platform is accessible to programmers via CUDA libraries and extensions to programming languages like C, C++ AND Fortran. – C/C++ programmers use “CUDA C/C++”, compiled with nvcc
compiler– Fortran programmers can use CUDA Fortran, compiled with PGI
CUDA Fortran
• Terminology:– Host: The CPU and its memory (host memory)– Device: The GPU and its memory (device memory)
Programming Paradigm
Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf
Parallel function of application: execute as a kernel
Programming Flow
1. Copy input data from CPU memory to GPU memory
2. Load GPU program and execute3. Copy results from GPU memory to CPU
memory
• Each parallel function of application is execute as a kernel
• That means GPUs are programmed as a sequence of kernels; typically, each kernel completes execution before the next kernel begins.
• Fermi has some support for multiple, independent kernels to execute simultaneously, but most kernels are large enough to fill the entire machine.
Image copied from http://people.maths.ox.ac.uk/~gilesm/hpc/NVIDIA/NVIDIA_CUDA_Tutorial_No_NDA_Apr08.pdf
Hello World! Example
Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf
_ _global_ _ is a CUDA C/C++ keyword meaning • mykernel() will be exectued on the device• mykernel() will be called from the host
Addition Example
• Since add runs on device, pointers a, b, and c must point to device memory
Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf
Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf
Vector Addition Example
Kernel Function:
Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf
main:
Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf
Alternative 1:
Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf
Alternative 2:
int globalThreadId = threadIdx.x + blockIdx.x * M //M is the number of threads in a block
Int globalThreadId = threadIdx.x + blockIdx.x * blockDim.x
Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf
• So the kernel becomes
Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf
• The main becomes
Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf
Handling Arbitrary Vector Sizes
Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf
Recommended