17
First CUDA Program

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

Embed Size (px)

Citation preview

Page 1: First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

First CUDA Program

Page 2: First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

#include "stdio.h"int main(){ printf("Hello, world\n"); return 0;}

#include <cuda.h>#include <stdio.h>__global__ void kernel (void)

{ }int main (void){

kernel <<< 1, 1 >>> ();printf("Hello World!\n");

return 0;}

First C program First CUDA program

Compilationnvcc -o first first.cu./first

Compilationgcc -o first first.c./first

Page 3: First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

Kernels• CUDA C extends C by allowing the programmer to

define C functions, called kernels, – when called, are executed N times in parallel by N

different CUDA threads, as opposed to only once like regular C functions.

• A kernel is defined using the _ _global_ _ declaration specifier.

• The number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<…>>> execution configuration syntax

Amrita School of Biotechnology

Page 4: First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

Example Program 1• “__global__” says

the function is to be compiled to run on a “device” (GPU), not “host” (CPU)

• Angle brackets “<<<“ and “>>>” for passing params/args to runtime

#include <cuda.h>#include <stdio.h>

__global__ void kernel (void) { }

int main (void){

kernel <<< 1, 1 >>> ();

printf("Hello World!\n");

return 0;} A function executed on the GPU

(device) is usually called a “kernel”Amrita School of Biotechnology

Page 5: First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

Example Program 2 •We can pass parameters to a kernel as we would with any C function

_ _global_ _ void add(int a, int b, int *c){ *c = a+b;}

int main (void) {int c, *dev_c;cudaMalloc ((void **) &dev_c, sizeof (int));add <<< 1, 1 >>> (2,7, dev_c);cudaMemcpy(&c, dev_c, sizeof(int),

cudaMemcpyDeviceToHost);printf(“2 + 7 = %d\n“, c);cudaFree(dev_c);

return 0;}

We need to allocate memory to do anything useful on a device

BlocksPerGrid, threadsPerBlock

Amrita School of Biotechnology

Page 6: First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

CUDA Device Memory Allocation• cudaMalloc() : cudaError_t cudaMalloc ( void ** devPtr, size_t size )

– Allocates object in the device Global Memory– Allocates size bytes of linear memory

on the device and returns in *devPtr

a pointer to the allocated memory.– Requires two parameters

• Address of a pointer to the allocated object• Size of allocated object

• cudaFree()

• cudaError_t cudaFree ( void * devPtr ) – Frees object from device

Global Memory• Pointer to freed object

Grid

GlobalMemory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Amrita School of Biotechnology

Page 7: First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

CUDA Device Memory Allocation (cont.)

• Code example: – Allocate a 64 * 64 single precision float array– Attach the allocated storage to Md– “d” is often used to indicate a device data structure

TILE_WIDTH = 64;Float* Mdint size = TILE_WIDTH * TILE_WIDTH * sizeof(float);

cudaMalloc((void**)&Md, size);cudaFree(Md);

Amrita School of Biotechnology

Page 8: First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

Amrita School of Biotechnology

Thread

Per-threadLocal Memory

Block

Per-blockSharedMemory

Memory model

Kernel 0

. . .

Per-deviceGlobal

Memory

. . .

Kernel 1

SequentialKernels

Device 0memory

Device 1memory

Host memory cudaMemcpy()

There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces.

The global, constant, and texture memory spaces are persistent across kernel launches by the same application.

The CUDA programming model assumes that both the host and the device maintain their own separate memory spaces in DRAM, referred to as host memory and device memory, respectively.

Page 9: First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

Amrita School of Biotechnology

• Each thread can:– Read/write per-thread

registers– Read/write per-thread local

memory– Read/write per-block shared

memory– Read/write per-grid global

memory– Read/only per-grid constant

memory

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

Page 10: First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

CUDA Host-Device Data Transfer• cudaMemcpy() :

– memory data transfer

cudaError_t cudaMemcpy ( void * dst,

const void * src, size_t count,

enum cudaMemcpyKind kind ) – Requires four parameters

• Pointer to destination

• Pointer to source

• Number of bytes copied

• Type of transfer

– Host to Host, cudaMemcpyHostToHost– Host to Device: cudaMemcpyHostToDevice– Device to Host: cudaMemcpyDeviceToHost– Device to Device : cudaMemcpyDeviceToDevice

Grid

GlobalMemory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Amrita School of Biotechnology

Page 11: First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

CUDA Host-Device Data Transfer(cont.)

• Code example: – Transfer a 64 * 64 single precision float array

– M is in host memory and Md is in device memory

– cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants

cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

Amrita School of Biotechnology

Page 12: First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

Summing VectorsA simple example to illustrate threads and how we use them to code with CUDA C.

Amrita School of Biotechnology

Page 13: First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

#define N 10void add( int *a, int *b, int *c ) {

int tid = 0; // this is CPU zero, so we start at zerowhile (tid < N) { c[tid] = a[tid] + b[tid]; tid += 1; // we have one CPU, so we increment by one

}}int main( void ) {

int a[N], b[N], c[N];for (int i=0; i<N; i++) { // fill the arrays 'a' and 'b' on the CPU

a[i] = -i;b[i] = i * i;

}add( a, b, c ); // display the resultsfor (int i=0; i<N; i++) {printf( "%d + %d = %d\n", a[i], b[i], c[i] );}

return 0;}

Traditional C code in CPU:

Amrita School of Biotechnology

Page 14: First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

GPU Vector SumsWe can accomplish the same addition very similarly on a GPU by writing add() as a device function. #define N 10int main( void ) {

int a[N], b[N], c[N];int *dev_a, *dev_b, *dev_c;// allocate the memory on the GPUcudaMalloc( (void**)&dev_a, N * sizeof(int) ) ;cudaMalloc( (void**)&dev_b, N * sizeof(int) ) ;cudaMalloc( (void**)&dev_c, N * sizeof(int) ) ;// fill the arrays 'a' and 'b' on the CPUfor (int i=0; i<N; i++) {

a[i] = -i;b[i] = i * i;

}// copy the arrays 'a' and 'b' to the GPU cudaMemcpy( dev_a, a, N * sizeof(int),cudaMemcpyHostToDevice ) ;cudaMemcpy( dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice ) ;add<<<N,1>>>( dev_a, dev_b, dev_c );Amrita School of Biotechnology

Page 15: First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

// copy the array 'c' back from the GPU to the CPU cudaMemcpy( c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost ) ;// display the resultsfor (int i=0; i<N; i++) {printf( "%d + %d = %d\n", a[i], b[i], c[i] );}// free the memory allocated on the GPUcudaFree( dev_a );cudaFree( dev_b );cudaFree( dev_c );return 0;}

// Kernel definition __global__ void add( int *a, int *b, int *c ) {int tid = blockIdx.x; // handle the data at this indexif (tid < N)c[tid] = a[tid] + b[tid];}

Amrita School of Biotechnology

kernel callable from host__global__ void KernelFunc(...); function callable on device__device__ void DeviceFunc(...); variable in device memory__device__ int GlobalVar; in per-block shared memory__shared__ int SharedVar;

Page 16: First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

• In – add<<<N,1>>>( dev_a, dev_b, dev_c );

• N is the number of blocks that we want to run in parallel.– If we call add<<<4,1>>>(..), the function will have

four copies running in parallel, where each copy is named a block.

• Thread block = a (data) parallel task– all blocks in kernel have the same entry point– but may execute any code they want

Amrita School of Biotechnology

Page 17: First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

This is what the actual code being executed in each of the four parallel blocks looks like after the runtime substitutes the appropriate block index for blockIdx.x:

Runtime system is already launching a different kernel where each block willhave one of these indices, the work is done in parallel.

Amrita School of Biotechnology