GPU: Understanding CUDA

J.A.R. J.C.G. T.R.G.B.

GPU: UNDERSTANDING CUDA

TALK STRUCTURE

• What is CUDA? • History of GPU • Hardware Presentation • How does it work? • Code Example • Examples & Videos • Results & Conclusion

WHAT IS CUDA

• Compute Unified Device Architecture • Is a parallel computing platform and

programming model created by NVIDIA and implemented by the graphics processing units (GPUs) that they produce

• CUDA gives developers access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs

http://en.wikipedia.org/wiki/Parallel_computing

http://en.wikipedia.org/wiki/NVIDIA

http://en.wikipedia.org/wiki/Graphics_processing_unit

http://en.wikipedia.org/wiki/Graphics_processing_unit

http://en.wikipedia.org/wiki/Instruction_set

HISTORY • 1981 – Monochrome Display Adapter

• 1988 – VGA Standard (VGA Controller) – VESA Founded

• 1989 – SVGA

• 1993 – PCI – NVidia Founded

• 1996 – AGP – Voodoo Graphics – Pentium

• 1999 – NVidia GeForce 256 – P3

• 2004 – PCI Express – GeForce6600 – P4

• 2006 – GeForce 8800

• 2008 – GeForce GTX280 / Core2

HISTORICAL PC

CPU

North Bridge Memory

South Bridge VGA Controller

Screen Memory

Buffer

LAN UART

System Bus

PCI Bus

INTEL PC STRUCTURE

NEW INTEL PC STRUCTURE

VOODOO GRAPHICS SYSTEM ARCHITECTURE

Geom Gather

Geom Proc

Triangle Proc

Pixel Proc Z / Blend

CPU Core Logic FBI

FB Memory

System Memory

TMU TEX Memory

GPU CPU

GEFORCE GTX280 SYSTEM ARCHITECTURE

Geom Gather

Geom Proc

Triangle Proc

Pixel Proc

Z / Blend

CPU Core Logic GPU

GPU Memory

System Memory

GPU CPU

Physics and AI

Scene Mgmt

CUDA ARCHITECTURE ROADMAP

SOUL OF NVIDIA’S GPU ROADMAP

• Increase Performance / Watt • Make Parallel Programming Easier • Run more of the Application on the GPU

MYTHS ABOUT CUDA

• You have to port your entire application to the GPU

• It is really hard to accelerate your application • There is a PCI-e Bottleneck

CUDA MODELS

• Device Model • Execution Model

DEVICE MODEL Scalar

Processor Many Scalar Processors + Register File + Shared Memory

DEVICE MODEL Multiprocessor Device

DEVICE MODEL

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture Texture Texture

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Load/store Load/store Load/store Load/store Load/store

HARDWARE PRESENTATION Geforce GTS450

HARDWARE PRESENTATION Geforce GTS450

HARDWARE PRESENTATION Geforce GTS450 Especificaciones

HARDWARE PRESENTATION Geforce GTX470

HARDWARE PRESENTATION Geforce GTX470 Especificaciones

HARDWARE PRESENTATION

HARDWARE PRESENTATION Geforce 8600 GT/GTS Especificaciones

EXECUTION MODEL

Vocabulary: • Host: CPU.

• Device: GPU.

• Kernel: A piece of code executed on GPU. ( function, program.. )

• SIMT: Single Instruction Multiple Threads

• Warps: A set of 32 threads. Minimum size of the data processed in SIMT.

EXECUTION MODEL

All threads execute same code. Each thread have an unique identifier (threadID (x,y,z))

A CUDA kernel is executed by an array of threads

SIMT

EXECUTION MODEL - SOFTWARE

Grid: A set of Blocks

Thread: Smallest logict unit

Block: A set of Threads. (Max 512)

• Private Shared Memory • Barrier (Threads synchronization)

• Barrier ( Grid synchronization) • Without synchronization between blocks

EXECUTION MODEL

Specified by the programmer at Runtime - Number of blocks (gridDim) - Block size (BlockDim) CUDA kernel invocation f <<<G, B>>>(a, b, c)

EXECUTION MODEL - MEMORY ARCHITECTURE

EXECUTION MODEL

Each thread runs on a scalar processor

Thread blocks are running on the multiprocessor

A Grid only run a CUDA Kernel

SCHEDULE tie

mpo

warp 8 instrucción 11




.

.

.


Bloque 1 Bloque 2 Bloque n

warp 1 2

m

warp 2 2

m

warp 2 2

m

• Threads are grouped into blocks • IDs are assigned to blocks and

threads • Blocks threads are distributed

among the multiprocessors • Threads of a block are grouped into

warps • A warp is the smallest unit of

planning and consists of 32 threads • Various warps on each

multiprocessor, but only one is running

CODE EXAMPLE The following program calculates and prints the square of first 100 integers. // 1) Include header files #include <stdio.h>

#include <conio.h> #include <cuda.h>

// 2) Kernel that executes on the CUDA device __global__ void square_array(float*a,int N) {

int idx=blockIdx.x*blockDim.x+threadIdx.x; if (idx <N )

a[idx]=a[idx]*a[idx]; }

// 3) main( ) routine, the CPU must find int main(void) {

CODE EXAMPLE // 3.1:- Define pointer to host and device arrays float*a_h,*a_d; // 3.2:- Define other variables used in the program e.g. arrays etc. const int N=100; size_t size=N*sizeof(float); // 3.3:- Allocate array on the host a_h=(float*)malloc(size); // 3.4:- Allocate array on device (DRAM of the GPU) cudaMalloc((void**)&a_d,size);

for (int i=0;i<N;i ++) a_h[i]=(float)i;

CODE EXAMPLE // 3.5:- Copy the data from host array to device array.

cudaMemcpy(a_d,a_h,size,cudaMemcpyHostToDevice);

// 3.6:- Kernel Call, Execution Configuration

int block_size=4; int n_blocks=N / block_size + ( N % block_size ==0); square_array<<<n_blocks,block_size>>>(a_d,N);

// 3.7:- Retrieve result from device to host in the host memory

cudaMemcpy(a_h,a_d,sizeof(float)*N,cudaMemcpyDeviceToHost);

CODE EXAMPLE // 3.8:- Print result

for(int i=0;i<N;i++)

printf("%d\t%f\n",i,a_h[i]);

// 3.9:- Free allocated memories on the device and host

free(a_h); cudaFree(a_d); getch(); } )

CUDA LIBRARIES

TESTING

TESTING

TESTING

EXAMPLES

• Video Example with a NVidia Tesla • Development Environment

RADIX SORT RESULTS.

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1.000.000 10.000.000 51.000.000 100.000.000

GTS 450GTX 470GeForce 8600GTX 560M

CONCLUSION

• Easy to use and powerful so it is worth! • GPU computing is the future. The Results

confirm our theory and the industry is giving more and more importance.

• In the next years we will see more applications that are using parallel computing

DOCUMENTATION & LINKS • http://www.nvidia.es/object/cuda_home_new_es.html

• http://www.nvidia.com/docs/IO/113297/ISC-Briefing-Sumit-June11-Final.pdf

• http://cs.nyu.edu/courses/spring12/CSCI-GA.3033-012/lecture5.pdf

• http://www.hpca.ual.es/~jmartine/CUDA/SESION3_CUDA_GPU_EMG_JAM.pdf

• http://www.geforce.com/hardware/technology/cuda/supported-gpus

• http://en.wikipedia.org/wiki/GeForce_256

• http://en.wikipedia.org/wiki/CUDA

• https://developer.nvidia.com/technologies/Libraries

• https://www.udacity.com/wiki/cs344/troubleshoot_gcc47

• http://stackoverflow.com/questions/12986701/installing-cuda-5-samples-in-ubuntu-12-10

http://www.nvidia.es/object/cuda_home_new_es.html

http://www.nvidia.com/docs/IO/113297/ISC-Briefing-Sumit-June11-Final.pdf

http://cs.nyu.edu/courses/spring12/CSCI-GA.3033-012/lecture5.pdf

http://www.hpca.ual.es/~jmartine/CUDA/SESION3_CUDA_GPU_EMG_JAM.pdf

http://www.geforce.com/hardware/technology/cuda/supported-gpus

http://en.wikipedia.org/wiki/GeForce_256

http://en.wikipedia.org/wiki/CUDA

https://developer.nvidia.com/technologies/Libraries

https://www.udacity.com/wiki/cs344/troubleshoot_gcc47

http://stackoverflow.com/questions/12986701/installing-cuda-5-samples-in-ubuntu-12-10

http://stackoverflow.com/questions/12986701/installing-cuda-5-samples-in-ubuntu-12-10

QUESTIONS?

Technology

GPU: Understanding CUDA