Upload
joaquin-aparicio-ramos
View
960
Download
6
Embed Size (px)
DESCRIPTION
Citation preview
J.A.R. J.C.G. T.R.G.B.
GPU: UNDERSTANDING CUDA
TALK STRUCTURE
• What is CUDA? • History of GPU • Hardware Presentation • How does it work? • Code Example • Examples & Videos • Results & Conclusion
WHAT IS CUDA
• Compute Unified Device Architecture • Is a parallel computing platform and
programming model created by NVIDIA and implemented by the graphics processing units (GPUs) that they produce
• CUDA gives developers access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs
HISTORY • 1981 – Monochrome Display Adapter
• 1988 – VGA Standard (VGA Controller) – VESA Founded
• 1989 – SVGA
• 1993 – PCI – NVidia Founded
• 1996 – AGP – Voodoo Graphics – Pentium
• 1999 – NVidia GeForce 256 – P3
• 2004 – PCI Express – GeForce6600 – P4
• 2006 – GeForce 8800
• 2008 – GeForce GTX280 / Core2
HISTORICAL PC
CPU
North Bridge Memory
South Bridge VGA Controller
Screen Memory
Buffer
LAN UART
System Bus
PCI Bus
INTEL PC STRUCTURE
NEW INTEL PC STRUCTURE
VOODOO GRAPHICS SYSTEM ARCHITECTURE
Geom Gather
Geom Proc
Triangle Proc
Pixel Proc Z / Blend
CPU Core Logic FBI
FB Memory
System Memory
TMU TEX Memory
GPU CPU
GEFORCE GTX280 SYSTEM ARCHITECTURE
Geom Gather
Geom Proc
Triangle Proc
Pixel Proc
Z / Blend
CPU Core Logic GPU
GPU Memory
System Memory
GPU CPU
Physics and AI
Scene Mgmt
CUDA ARCHITECTURE ROADMAP
SOUL OF NVIDIA’S GPU ROADMAP
• Increase Performance / Watt • Make Parallel Programming Easier • Run more of the Application on the GPU
MYTHS ABOUT CUDA
• You have to port your entire application to the GPU
• It is really hard to accelerate your application • There is a PCI-e Bottleneck
CUDA MODELS
• Device Model • Execution Model
DEVICE MODEL Scalar
Processor Many Scalar Processors + Register File + Shared Memory
DEVICE MODEL Multiprocessor Device
DEVICE MODEL
Load/store
Global Memory
Thread Execution Manager
Input Assembler
Host
Texture Texture Texture Texture Texture Texture Texture Texture Texture
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Load/store Load/store Load/store Load/store Load/store
HARDWARE PRESENTATION Geforce GTS450
HARDWARE PRESENTATION Geforce GTS450
HARDWARE PRESENTATION Geforce GTS450 Especificaciones
HARDWARE PRESENTATION Geforce GTX470
HARDWARE PRESENTATION Geforce GTX470 Especificaciones
HARDWARE PRESENTATION
HARDWARE PRESENTATION Geforce 8600 GT/GTS Especificaciones
EXECUTION MODEL
Vocabulary: • Host: CPU.
• Device: GPU.
• Kernel: A piece of code executed on GPU. ( function, program.. )
• SIMT: Single Instruction Multiple Threads
• Warps: A set of 32 threads. Minimum size of the data processed in SIMT.
EXECUTION MODEL
All threads execute same code. Each thread have an unique identifier (threadID (x,y,z))
A CUDA kernel is executed by an array of threads
SIMT
EXECUTION MODEL - SOFTWARE
Grid: A set of Blocks
Thread: Smallest logict unit
Block: A set of Threads. (Max 512)
• Private Shared Memory • Barrier (Threads synchronization)
• Barrier ( Grid synchronization) • Without synchronization between blocks
EXECUTION MODEL
Specified by the programmer at Runtime - Number of blocks (gridDim) - Block size (BlockDim) CUDA kernel invocation f <<<G, B>>>(a, b, c)
EXECUTION MODEL - MEMORY ARCHITECTURE
EXECUTION MODEL
Each thread runs on a scalar processor
Thread blocks are running on the multiprocessor
A Grid only run a CUDA Kernel
SCHEDULE tie
mpo
warp 8 instrucción 11
warp 1 instrucción 42
warp 3 instrucción 95
warp 8 instrucción 12
.
.
.
warp 3 instrucción 96
Bloque 1 Bloque 2 Bloque n
warp 1 2
m
warp 2 2
m
warp 2 2
m
• Threads are grouped into blocks • IDs are assigned to blocks and
threads • Blocks threads are distributed
among the multiprocessors • Threads of a block are grouped into
warps • A warp is the smallest unit of
planning and consists of 32 threads • Various warps on each
multiprocessor, but only one is running
CODE EXAMPLE The following program calculates and prints the square of first 100 integers. // 1) Include header files #include <stdio.h>
#include <conio.h> #include <cuda.h>
// 2) Kernel that executes on the CUDA device __global__ void square_array(float*a,int N) {
int idx=blockIdx.x*blockDim.x+threadIdx.x; if (idx <N )
a[idx]=a[idx]*a[idx]; }
// 3) main( ) routine, the CPU must find int main(void) {
CODE EXAMPLE // 3.1:- Define pointer to host and device arrays float*a_h,*a_d; // 3.2:- Define other variables used in the program e.g. arrays etc. const int N=100; size_t size=N*sizeof(float); // 3.3:- Allocate array on the host a_h=(float*)malloc(size); // 3.4:- Allocate array on device (DRAM of the GPU) cudaMalloc((void**)&a_d,size);
for (int i=0;i<N;i ++) a_h[i]=(float)i;
CODE EXAMPLE // 3.5:- Copy the data from host array to device array.
cudaMemcpy(a_d,a_h,size,cudaMemcpyHostToDevice);
// 3.6:- Kernel Call, Execution Configuration
int block_size=4; int n_blocks=N / block_size + ( N % block_size ==0); square_array<<<n_blocks,block_size>>>(a_d,N);
// 3.7:- Retrieve result from device to host in the host memory
cudaMemcpy(a_h,a_d,sizeof(float)*N,cudaMemcpyDeviceToHost);
CODE EXAMPLE // 3.8:- Print result
for(int i=0;i<N;i++)
printf("%d\t%f\n",i,a_h[i]);
// 3.9:- Free allocated memories on the device and host
free(a_h); cudaFree(a_d); getch(); } )
CUDA LIBRARIES
TESTING
TESTING
TESTING
EXAMPLES
• Video Example with a NVidia Tesla • Development Environment
RADIX SORT RESULTS.
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1.000.000 10.000.000 51.000.000 100.000.000
GTS 450GTX 470GeForce 8600GTX 560M
CONCLUSION
• Easy to use and powerful so it is worth! • GPU computing is the future. The Results
confirm our theory and the industry is giving more and more importance.
• In the next years we will see more applications that are using parallel computing
DOCUMENTATION & LINKS • http://www.nvidia.es/object/cuda_home_new_es.html
• http://www.nvidia.com/docs/IO/113297/ISC-Briefing-Sumit-June11-Final.pdf
• http://cs.nyu.edu/courses/spring12/CSCI-GA.3033-012/lecture5.pdf
• http://www.hpca.ual.es/~jmartine/CUDA/SESION3_CUDA_GPU_EMG_JAM.pdf
• http://www.geforce.com/hardware/technology/cuda/supported-gpus
• http://en.wikipedia.org/wiki/GeForce_256
• http://en.wikipedia.org/wiki/CUDA
• https://developer.nvidia.com/technologies/Libraries
• https://www.udacity.com/wiki/cs344/troubleshoot_gcc47
• http://stackoverflow.com/questions/12986701/installing-cuda-5-samples-in-ubuntu-12-10
QUESTIONS?