43
J.A.R. J.C.G. T.R.G.B. GPU: UNDERSTANDING CUDA

GPU: Understanding CUDA

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: GPU: Understanding CUDA

J.A.R. J.C.G. T.R.G.B.

GPU: UNDERSTANDING CUDA

Page 2: GPU: Understanding CUDA

TALK STRUCTURE

• What is CUDA? • History of GPU • Hardware Presentation • How does it work? • Code Example • Examples & Videos • Results & Conclusion

Page 3: GPU: Understanding CUDA

WHAT IS CUDA

• Compute Unified Device Architecture • Is a parallel computing platform and

programming model created by NVIDIA and implemented by the graphics processing units (GPUs) that they produce

• CUDA gives developers access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs

Page 4: GPU: Understanding CUDA

HISTORY • 1981 – Monochrome Display Adapter

• 1988 – VGA Standard (VGA Controller) – VESA Founded

• 1989 – SVGA

• 1993 – PCI – NVidia Founded

• 1996 – AGP – Voodoo Graphics – Pentium

• 1999 – NVidia GeForce 256 – P3

• 2004 – PCI Express – GeForce6600 – P4

• 2006 – GeForce 8800

• 2008 – GeForce GTX280 / Core2

Page 5: GPU: Understanding CUDA

HISTORICAL PC

CPU

North Bridge Memory

South Bridge VGA Controller

Screen Memory

Buffer

LAN UART

System Bus

PCI Bus

Page 6: GPU: Understanding CUDA

INTEL PC STRUCTURE

Page 7: GPU: Understanding CUDA

NEW INTEL PC STRUCTURE

Page 8: GPU: Understanding CUDA

VOODOO GRAPHICS SYSTEM ARCHITECTURE

Geom Gather

Geom Proc

Triangle Proc

Pixel Proc Z / Blend

CPU Core Logic FBI

FB Memory

System Memory

TMU TEX Memory

GPU CPU

Page 9: GPU: Understanding CUDA

GEFORCE GTX280 SYSTEM ARCHITECTURE

Geom Gather

Geom Proc

Triangle Proc

Pixel Proc

Z / Blend

CPU Core Logic GPU

GPU Memory

System Memory

GPU CPU

Physics and AI

Scene Mgmt

Page 10: GPU: Understanding CUDA

CUDA ARCHITECTURE ROADMAP

Page 11: GPU: Understanding CUDA

SOUL OF NVIDIA’S GPU ROADMAP

• Increase Performance / Watt • Make Parallel Programming Easier • Run more of the Application on the GPU

Page 12: GPU: Understanding CUDA

MYTHS ABOUT CUDA

• You have to port your entire application to the GPU

• It is really hard to accelerate your application • There is a PCI-e Bottleneck

Page 13: GPU: Understanding CUDA

CUDA MODELS

• Device Model • Execution Model

Page 14: GPU: Understanding CUDA

DEVICE MODEL Scalar

Processor Many Scalar Processors + Register File + Shared Memory

Page 15: GPU: Understanding CUDA

DEVICE MODEL Multiprocessor Device

Page 16: GPU: Understanding CUDA

DEVICE MODEL

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture Texture Texture

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Parallel Data Cache

Load/store Load/store Load/store Load/store Load/store

Page 17: GPU: Understanding CUDA

HARDWARE PRESENTATION Geforce GTS450

Page 18: GPU: Understanding CUDA

HARDWARE PRESENTATION Geforce GTS450

Page 19: GPU: Understanding CUDA

HARDWARE PRESENTATION Geforce GTS450 Especificaciones

Page 20: GPU: Understanding CUDA

HARDWARE PRESENTATION Geforce GTX470

Page 21: GPU: Understanding CUDA

HARDWARE PRESENTATION Geforce GTX470 Especificaciones

Page 22: GPU: Understanding CUDA

HARDWARE PRESENTATION

Page 23: GPU: Understanding CUDA

HARDWARE PRESENTATION Geforce 8600 GT/GTS Especificaciones

Page 24: GPU: Understanding CUDA

EXECUTION MODEL

Vocabulary: • Host: CPU.

• Device: GPU.

• Kernel: A piece of code executed on GPU. ( function, program.. )

• SIMT: Single Instruction Multiple Threads

• Warps: A set of 32 threads. Minimum size of the data processed in SIMT.

Page 25: GPU: Understanding CUDA

EXECUTION MODEL

All threads execute same code. Each thread have an unique identifier (threadID (x,y,z))

A CUDA kernel is executed by an array of threads

SIMT

Page 26: GPU: Understanding CUDA

EXECUTION MODEL - SOFTWARE

Grid: A set of Blocks

Thread: Smallest logict unit

Block: A set of Threads. (Max 512)

• Private Shared Memory • Barrier (Threads synchronization)

• Barrier ( Grid synchronization) • Without synchronization between blocks

Page 27: GPU: Understanding CUDA

EXECUTION MODEL

Specified by the programmer at Runtime - Number of blocks (gridDim) - Block size (BlockDim) CUDA kernel invocation f <<<G, B>>>(a, b, c)

Page 28: GPU: Understanding CUDA

EXECUTION MODEL - MEMORY ARCHITECTURE

Page 29: GPU: Understanding CUDA

EXECUTION MODEL

Each thread runs on a scalar processor

Thread blocks are running on the multiprocessor

A Grid only run a CUDA Kernel

Page 30: GPU: Understanding CUDA

SCHEDULE tie

mpo

warp 8 instrucción 11

warp 1 instrucción 42

warp 3 instrucción 95

warp 8 instrucción 12

.

.

.

warp 3 instrucción 96

Bloque 1 Bloque 2 Bloque n

warp 1 2

m

warp 2 2

m

warp 2 2

m

• Threads are grouped into blocks • IDs are assigned to blocks and

threads • Blocks threads are distributed

among the multiprocessors • Threads of a block are grouped into

warps • A warp is the smallest unit of

planning and consists of 32 threads • Various warps on each

multiprocessor, but only one is running

Page 31: GPU: Understanding CUDA

CODE EXAMPLE The following program calculates and prints the square of first 100 integers. // 1) Include header files #include <stdio.h>

#include <conio.h> #include <cuda.h>

// 2) Kernel that executes on the CUDA device __global__ void square_array(float*a,int N) {

int idx=blockIdx.x*blockDim.x+threadIdx.x; if (idx <N )

a[idx]=a[idx]*a[idx]; }

// 3) main( ) routine, the CPU must find int main(void) {

Page 32: GPU: Understanding CUDA

CODE EXAMPLE // 3.1:- Define pointer to host and device arrays float*a_h,*a_d; // 3.2:- Define other variables used in the program e.g. arrays etc. const int N=100; size_t size=N*sizeof(float); // 3.3:- Allocate array on the host a_h=(float*)malloc(size); // 3.4:- Allocate array on device (DRAM of the GPU) cudaMalloc((void**)&a_d,size);

for (int i=0;i<N;i ++) a_h[i]=(float)i;

Page 33: GPU: Understanding CUDA

CODE EXAMPLE // 3.5:- Copy the data from host array to device array.

cudaMemcpy(a_d,a_h,size,cudaMemcpyHostToDevice);

// 3.6:- Kernel Call, Execution Configuration

int block_size=4; int n_blocks=N / block_size + ( N % block_size ==0); square_array<<<n_blocks,block_size>>>(a_d,N);

// 3.7:- Retrieve result from device to host in the host memory

cudaMemcpy(a_h,a_d,sizeof(float)*N,cudaMemcpyDeviceToHost);

Page 34: GPU: Understanding CUDA

CODE EXAMPLE // 3.8:- Print result

for(int i=0;i<N;i++)

printf("%d\t%f\n",i,a_h[i]);

// 3.9:- Free allocated memories on the device and host

free(a_h); cudaFree(a_d); getch(); } )

Page 35: GPU: Understanding CUDA

CUDA LIBRARIES

Page 36: GPU: Understanding CUDA

TESTING

Page 37: GPU: Understanding CUDA

TESTING

Page 38: GPU: Understanding CUDA

TESTING

Page 39: GPU: Understanding CUDA

EXAMPLES

• Video Example with a NVidia Tesla • Development Environment

Page 40: GPU: Understanding CUDA

RADIX SORT RESULTS.

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1.000.000 10.000.000 51.000.000 100.000.000

GTS 450GTX 470GeForce 8600GTX 560M

Page 41: GPU: Understanding CUDA

CONCLUSION

• Easy to use and powerful so it is worth! • GPU computing is the future. The Results

confirm our theory and the industry is giving more and more importance.

• In the next years we will see more applications that are using parallel computing

Page 42: GPU: Understanding CUDA

DOCUMENTATION & LINKS • http://www.nvidia.es/object/cuda_home_new_es.html

• http://www.nvidia.com/docs/IO/113297/ISC-Briefing-Sumit-June11-Final.pdf

• http://cs.nyu.edu/courses/spring12/CSCI-GA.3033-012/lecture5.pdf

• http://www.hpca.ual.es/~jmartine/CUDA/SESION3_CUDA_GPU_EMG_JAM.pdf

• http://www.geforce.com/hardware/technology/cuda/supported-gpus

• http://en.wikipedia.org/wiki/GeForce_256

• http://en.wikipedia.org/wiki/CUDA

• https://developer.nvidia.com/technologies/Libraries

• https://www.udacity.com/wiki/cs344/troubleshoot_gcc47

• http://stackoverflow.com/questions/12986701/installing-cuda-5-samples-in-ubuntu-12-10

Page 43: GPU: Understanding CUDA

QUESTIONS?