Gpu Recipe

Embed Size (px)

Citation preview

  • 7/30/2019 Gpu Recipe

    1/17

    Recipe for running simpleCUDA code on a GPU based

    Rocks cluster

    GPU workshop

    UR June 20,2008

    Alice Quillen

    Head node:sneth.pas.rochester.edu

    http://astro.pas.rochester.edu/~aquillen/gpuworkshop.html

  • 7/30/2019 Gpu Recipe

    2/17

    Outline

    The Kernel .cu CUDA files

    Calling __global__ functions

    Setting grids and threads

    Allocating memory on device Compiling in emulate mode

    Submitting a job to the queue using SGE Timing your job for benchmarks

  • 7/30/2019 Gpu Recipe

    3/17

    CUDA C, CU files

    Routines that call the device must be in

    plain C -- with extension .cu

    Often 2 files1) The kernel -- .cu file containing routines that are

    running on the device.

    2) A .cu file that calls the routines from kernel.#includes the kernel

    Optional additional .cpp or .c files withother routines can be linked

  • 7/30/2019 Gpu Recipe

    4/17

    Kernel, Threads + BlocksKernel has two different types of functions

    __global__ called from host, executed on device__device__ called from device, run on device

    __global__ void addone(float* d_array) // kernel is addone

    { int index = blockIdx.x * blockDim.x + threadIdx.x;d_array[index] += 1.0;

    }

    d_array must be allocated on devicethe above code loops over all threads and blocks

    automaticallyon a CPU this would be inside a loop

  • 7/30/2019 Gpu Recipe

    5/17

    Calling a __global__ routinethreads and blocks

    Calling the routine from the host to run on deviceint p = 256; // power of 2

    dim3 threads = (p,1,1); // number of threads per block is pdim3 grid = (arraylength/p,1,1); // number of blocks is arraylength divided by the

    // number of threads// arraylength is assumed a multiple of p

    int sharedMemsize = 0;addone > (d_array); // call the kernel

    Inside the routine blockDim.x is p (number of threads),threadIdx.x ranges from 0 to p-1, (covers all threads)blockIdx.x ranges from 0 to arraylength/p-1 (covers all blocks)

    __global__ void addone(float * d_array)

    {int index = blockIdx.x * blockDim.x + threadIdx.x;d_array[index] += 1.0;

    }

  • 7/30/2019 Gpu Recipe

    6/17

    Memory allocation on deviceint lengthvector = 10240; // note multiple of 256, pad with zeros if your vector

    // isnt a multipleunsigned int memSize = lengtharray*sizeof(float);

    // allocate array on hostfloat* h_array = (float*) malloc(memSize);// array on host

    float* d_array; // array on device// allocate array on device

    CUDA_SAFE_CALL(cudaMalloc((void**)&d_array, memSize ));// CUDA_SAFE_CALL utility Tool. If an error is found, it prints out and error// message, file name, and line number in file where the error can be found

    // Copy array from host to device

    CUDA_SAFE_CALL(cudaMemcpy(d_array, h_array, memSize,cudaMemcpyHostToDevice));

    // Copy array from device to hostCUDA_SAFE_CALL(cudaMemcpy(h_array, d_array, memSize,

    cudaMemcpyDeviceToHost));

  • 7/30/2019 Gpu Recipe

    7/17

    Global Device Memory Keep track of what data is on device and

    on host

    If you address host memory from device

    then the code doesnt work

  • 7/30/2019 Gpu Recipe

    8/17

    A code that adds 1 to every element of alarge array on the GPU

    #include // CUDA defined stuff#include// usual io

    #include// contains kernelint main(int argc, char ** argv){

    CUT_DEVICE_INIT();// device init, also will tell you if multiple devices// present

    Allocate host array, fill it with somethingAllocate device arrayCopy array from host to deviceLaunch kernelCopy array back from device to host

    Print out some of it out so you can see that the device didsomethingCUDA_SAFE_CALL(cudaFree(d_vector));// free memoryCUT_EXIT(argc, argv);// Use CUDA Utility Tool to exit cleanly

    }

  • 7/30/2019 Gpu Recipe

    9/17

    CompilingSample head of Makefile

    # name of executableEXECUTABLE := addone# Cuda source files (compiled with cudacc) don't put kernel

    in as is included in other file

    CUFILES := addone.cuCU_DEPS := addone_kernel.cu# C/C++ source files (compiled with gcc / c++)CCFILES :=

    C_DEPS :=ROOTDIR = /usr/local/NVIDIA_CUDA_SDK/common/include $(ROOTDIR)/common.mk # SDK make

  • 7/30/2019 Gpu Recipe

    10/17

    Emulator[aquillen@sneth ~]$ make emu=1Makes executable code that runs on the CPU not on device

    executable can be run on head node (which lacks GPUs)not necessary to send a job to queuein addone_kernel.cu:__global__ void addone(float* d_array)

    { int index = blockIdx.x * blockDim.x + threadIdx.x;d_array[index] += 1.0;printf(%.2e\n,d_array[index]);// emulator only!

    }

    Will not run or compile for device. Will run and compile in emulator.Prints for every block and thread.If (threadIdx.x == 0) printf(%.2e\n,d_array[index]); // much more manageable!

  • 7/30/2019 Gpu Recipe

    11/17

    Debugging Code that works in emulator mode may

    not work on device (particularly memoryaccess, synch problems)

    However debugging in emulator mode is apretty good way to get closer to workingcode

    Profiler recently available, but we have notyet had a chance to play with it.

  • 7/30/2019 Gpu Recipe

    12/17

    Queue system SGE = Sun Grid Engine is an open source

    batch-queuing system supported by SUN and itcomes as a roll with ROCKS

    Rocks is an open-source Linux cluster

    distribution that enables end users to easily buildcomputational clusters. CUDA comes as a roll.

    To submit a job

    [aquillen@sneth ~]$ qsub sge.bash

    where sge.bash is a text file that looks like

  • 7/30/2019 Gpu Recipe

    13/17

    Example sge.bash file#!/bin/bash remind me that this uses bash shell##$ -cwd use current directory

    #$ -o std.out stdout goes into this file#$ -e std.err stderr goes into this file#$ -S /bin/bash specify bash shell#$ -m be email at begin and end of job#$ -M [email protected] where to send email

    #$ -N gal10 name of job#$ -l h_rt=24:00:00 let it run no more than 24 hours#$ -V something about exporting env variables

    date +%s time in seconds for simple bench mark

    # execute some code comment/home/aquillen/galint/exe/release/mkgal.out executables/home/aquillen/galint/exe/release/galintdate print datedate +%s

    echo complete

  • 7/30/2019 Gpu Recipe

    14/17

    queue commands qstat look at what jobs are running

    qdel 701 kill that job

    qhost check all the nodes qstat j 703 explain to me what is going on

    with that job

  • 7/30/2019 Gpu Recipe

    15/17

    Workshophttp://astro.pas.rochester.edu/~aquillen/gpuworkshop/assigna.html

    Assignment #1 Submitting a job to the queue using SGE qsub Here is a sample sge (Grid Engine) file that is written in bash shell. sge.bash Transfer this file to your home directory on our rocks cluster sneth.pas.rochester.edu. Modify the file sge.bash so that it will print "hello, World" into a file called std.out and will sleep for 60 seconds or

    so.

    Submit it as a job to the queue (qsub sge.bash). Check the queue status to make sure your job is run (qstat).Assignment #2 Compiling and running the routine addone that adds 1 to every element in a large array Here is a sample kernel addone_kernel.cu

    Here is a routine that calls this kernel addone.cuA Makefile to compile it MakefileA sge bash file to send it to the queue cudasge.bash

    Transfer the above 4 files to your home directory on sneth. They can also be copied directly from the directory

    ~aquillen/addone/ on sneth. Edit the cudasge.bash file so it is appropriate for you. Compile the code with make. The executable will be in a subdirectory called release. Run it by submitting the

    cudasge.bash file to the queue. Compile the code in emulate model (make emu=1). The executable will be in a subdirectory called emurelease.

    Run this code. It will run on the head node so you don't need to submit it to the queue.Assignment #3 Write your own routine by modifying the above addone routines. Modify the kernel so that it computes the derivative of the array using a finite difference (f[i+1]-f[i]).

    You will need an input and output arrays as arguments to your __global__ routine in the kernel as the order thatthreads and blocks executed is not specified.

    You do not want to address memory outside the array as this will crash the code. If you check the addressing withan ifstatement your routine will take twice as long to run. Try to write the routine without an if statement. Hint: youcan trade an irrelevant amount of device global memory for the if statement.

    Assignment #4 Discuss how you could implement a box smooth or 1d convolution routine

  • 7/30/2019 Gpu Recipe

    16/17

    BenchmarkingThere are many ways to time routines. Here is the one usedin the SDK. I think it uses sys/math.h gettimeofday() in linux

    // initialize a timerunsigned int timer = 0;CUT_SAFE_CALL(cutCreateTimer(&timer));// start itCUT_SAFE_CALL(cutStartTimer(timer));

    //// run your code here

    //// Stop the timerCUT_SAFE_CALL(cutStopTimer(timer));printf( "Processing time: %f (ms)\n", cutGetTimerValue(timer));// time in milliseconds, returned as a float// Delete the timerCUT_SAFE_CALL(cutDeleteTimer(timer));

  • 7/30/2019 Gpu Recipe

    17/17

    Links

    http://astro.pas.rochester.edu/~aquillen/gpuworkshop.htmlIncludes links to the CUDA manual, the SDK (software devel kit, lots ofgood programming examples) and other on line tutorials