View
219
Download
2
Tags:
Embed Size (px)
Citation preview
Programming with Programming with CUDACUDAWS 08/09WS 08/09
Lecture 8Lecture 8Thu, 18 Nov, 2008Thu, 18 Nov, 2008
PreviouslyPreviously
CUDA Runtime ComponentCUDA Runtime Component– Common ComponentCommon Component
Data types, math functions, timing, Data types, math functions, timing, texturestextures
– Device ComponentDevice Component Math functions, warp voting, atomic Math functions, warp voting, atomic
functions, synch function, texturingfunctions, synch function, texturing
– Host ComponentHost Component High-level runtime APIHigh-level runtime API Low-level driver APILow-level driver API
PreviouslyPreviously
CUDA Runtime ComponentCUDA Runtime Component– Host Component APIsHost Component APIs
Mutually exclusiveMutually exclusive Runtime API is easier to program, hides Runtime API is easier to program, hides
some details from programmersome details from programmer Driver API gives low level control, harder Driver API gives low level control, harder
to programto program Provide: device initialization, management Provide: device initialization, management
of device, streams and eventsof device, streams and events
TodayToday
CUDA Runtime ComponentCUDA Runtime Component– Host Component APIsHost Component APIs
Provide: management of memory & Provide: management of memory & textures, OpenGL/Direct3D textures, OpenGL/Direct3D interoperability (NOT covered)interoperability (NOT covered)
Runtime API provides: emulation mode for Runtime API provides: emulation mode for debuggingdebugging
Driver API provides: management of Driver API provides: management of contexts & modules, execution controlcontexts & modules, execution control
Final ProjectsFinal Projects
Memory Management: Linear MemoryMemory Management: Linear Memory– CUDA Runtime APICUDA Runtime API
Declare: Declare: TYPE*TYPE*Allocate: Allocate: cudaMalloc, cudaMallocPitchcudaMalloc, cudaMallocPitchCopy: Copy: cudaMemcpy, cudaMemcpy2DcudaMemcpy, cudaMemcpy2DFree: Free: cudaFreecudaFree
– CUDA Driver APICUDA Driver APIDeclare: Declare: CUdeviceptrCUdeviceptrAllocate: Allocate: cuMemAlloc, cuMemAllocPitchcuMemAlloc, cuMemAllocPitchCopy: Copy: cuMemcpy, cuMemcpy2DcuMemcpy, cuMemcpy2DFree: Free: cuMemFreecuMemFree
Host Runtime Host Runtime ComponentComponent
Memory Management: Linear MemoryMemory Management: Linear Memory– Pitch (stride) – expected:Pitch (stride) – expected:// host code// host codefloat *array2D;float *array2D;cudaMallocPitchcudaMallocPitch ((void**) array2D, ((void**) array2D, width*sizeof (float), height);width*sizeof (float), height);// device code// device codeint size = width * sizeof (float);int size = width * sizeof (float);for (int r = 0; r < height; ++r) {for (int r = 0; r < height; ++r) { float *row float *row == (float*)(float*)
((char*)array2D + r*size;((char*)array2D + r*size; for (int c = 0; c < width; ++c) for (int c = 0; c < width; ++c) float element = row[c]; float element = row[c];}}
Host Runtime Host Runtime ComponentComponent
Memory Management: Linear MemoryMemory Management: Linear Memory– Pitch (stride) – expected, WRONG:Pitch (stride) – expected, WRONG:// host code// host codefloat *array2D;float *array2D;cudaMallocPitchcudaMallocPitch ((void**) array2D, ((void**) array2D, width*sizeof (float), height);width*sizeof (float), height);// device code// device codeint size = width * sizeof (float);int size = width * sizeof (float);for (int r = 0; r < height; ++r) {for (int r = 0; r < height; ++r) { float *row float *row == (float*)(float*)
((char*)array2D + r*size;((char*)array2D + r*size; for (int c = 0; c < width; ++c) for (int c = 0; c < width; ++c) float element = row[c]; float element = row[c];}}
Host Runtime Host Runtime ComponentComponent
Memory Management: Linear MemoryMemory Management: Linear Memory– Pitch (stride) – CORRECT:Pitch (stride) – CORRECT:// host code// host codefloat *array2D; int pitch;float *array2D; int pitch;cudaMallocPitchcudaMallocPitch ((void**) array2D, ((void**) array2D, &pitch&pitch, , width*sizeof (float), height);width*sizeof (float), height);// device code// device codefor (int r = 0; r < height; ++r) {for (int r = 0; r < height; ++r) { float *row float *row == (float*)(float*)
((char*)array2D + r*((char*)array2D + r*pitchpitch;; for (int c = 0; c < width; ++c) for (int c = 0; c < width; ++c) float element = row[c]; float element = row[c];}}
Host Runtime Host Runtime ComponentComponent
Memory Management: Linear MemoryMemory Management: Linear Memory– Pitch (stride) – why?Pitch (stride) – why?
Allocation using pitch functions Allocation using pitch functions appropriately pads memory for appropriately pads memory for efficient transfer and copyefficient transfer and copy
Width of allocated rows may Width of allocated rows may exceed exceed width*sizeof(float)width*sizeof(float)
True width given by True width given by pitchpitch
Host Runtime Host Runtime ComponentComponent
Memory Management: CUDA ArraysMemory Management: CUDA Arrays– CUDA Runtime APICUDA Runtime API
Declare: Declare: cudaArray*cudaArray*Channel: Channel: cudaChannelFormatDesc, cudaChannelFormatDesc, cudaCreateChannelDesc<TYPE>cudaCreateChannelDesc<TYPE>Allocate: Allocate: cudaMallocArraycudaMallocArrayCopy (from linear): Copy (from linear): cudaMemcpy2DToArraycudaMemcpy2DToArrayFree: Free: cudaFreeArraycudaFreeArray
Host Runtime Host Runtime ComponentComponent
Memory Management: CUDA ArraysMemory Management: CUDA Arrays– CUDA Driver APICUDA Driver API
Declare: Declare: CUarrayCUarrayChannel: Channel: CUDA_ARRAY_DESCRIPTOR CUDA_ARRAY_DESCRIPTOR objectobjectAllocate: Allocate: cuArrayCreatecuArrayCreateCopy (from linear): Copy (from linear): CUDA_MEMCPY2D CUDA_MEMCPY2D objectobjectFree: Free: cuArrayDestroycuArrayDestroy
Host Runtime Host Runtime ComponentComponent
Memory Management: various other Memory Management: various other functions to copy fromfunctions to copy from– Linear memory to CUDA arraysLinear memory to CUDA arrays– Host to constant memoryHost to constant memory– See Reference ManualSee Reference Manual
Host Runtime Host Runtime ComponentComponent
Texture ManagementTexture Management– Run-time API: Run-time API: texturetexture type derived type derived
fromfromstruct textureReference {struct textureReference { int normalized; int normalized; enum cudaTextureFilterMode filterMode; enum cudaTextureFilterMode filterMode; enum cudaTextureAddressMode addressMode[3]; enum cudaTextureAddressMode addressMode[3]; struct cudaChannelFormatDesc channelDesc; struct cudaChannelFormatDesc channelDesc;}}
– normalizednormalized: 0: false, otherwise true: 0: false, otherwise true
Host Runtime Host Runtime ComponentComponent
Texture ManagementTexture Management– filterMode:filterMode:cudaFilterModePoint:cudaFilterModePoint: no filtering, no filtering, returned value is of nearest texelreturned value is of nearest texel cudaFilterModeLinear:cudaFilterModeLinear: filters 2/4/8 filters 2/4/8 neighbors for 1D/2D/3D texture, floats neighbors for 1D/2D/3D texture, floats onlyonly
– addressMode: (x,y,z)addressMode: (x,y,z)cudaAddressModeClamp, cudaAddressModeClamp, cudaAddressModeWrap:cudaAddressModeWrap: normalized normalized coordinates onlycoordinates only
Host Runtime Host Runtime ComponentComponent
Texture ManagementTexture Management– channelDescchannelDesc: texel type: texel typestruct cudaChannelFormatDesc {struct cudaChannelFormatDesc { int x,y,z,w; int x,y,z,w; enum cudaChannelFormatKind f; enum cudaChannelFormatKind f;}}
x,y,z,wx,y,z,w: #bits per component: #bits per component f: cudaChannelFormatKindSigned, f: cudaChannelFormatKindSigned, cudaChannelFormatKindUnSigned, cudaChannelFormatKindUnSigned, cudaChannelFormatKindFloatcudaChannelFormatKindFloat
Host Runtime Host Runtime ComponentComponent
Texture ManagementTexture Management– Run-time API: Run-time API: texturetexture type derived type derived
fromfromstruct textureReference {struct textureReference { int normalized; int normalized; enum cudaTextureFilterMode filterMode; enum cudaTextureFilterMode filterMode; enum cudaTextureAddressMode addressMode[3]; enum cudaTextureAddressMode addressMode[3]; struct cudaChannelFormatDesc channelDesc; struct cudaChannelFormatDesc channelDesc;}}
– Apply only to texture references Apply only to texture references bound to CUDA arraysbound to CUDA arrays
Host Runtime Host Runtime ComponentComponent
Texture ManagementTexture Management– Binding a texture reference to a Binding a texture reference to a
texturetextureRuntime API:Runtime API:
– Linear memory: Linear memory: cudaBindTexturecudaBindTexture– CUDA Array: CUDA Array: cudaBindTextureToArraycudaBindTextureToArray
Driver API:Driver API:– Linear memory: Linear memory: cuTexRefSetAddresscuTexRefSetAddress– CUDA Array: CUDA Array: cuTexRefSetArraycuTexRefSetArray
Host Runtime Host Runtime ComponentComponent
Runtime API: debugging using the Runtime API: debugging using the emulation modeemulation mode– No native debug support for device No native debug support for device
codecode– Code should be compiled either for Code should be compiled either for
device emulation OR execution: device emulation OR execution: mixing not allowedmixing not allowed
– Device code is compiled for the hostDevice code is compiled for the host
Host Runtime Host Runtime ComponentComponent
Runtime API: debugging using the Runtime API: debugging using the emulation modeemulation mode– FeaturesFeatures
Each CUDA thread is mapped to a Each CUDA thread is mapped to a host thread, plus one master host thread, plus one master threadthread
Each thread gets 256KB on stackEach thread gets 256KB on stack
Host Runtime Host Runtime ComponentComponent
Runtime API: debugging using the Runtime API: debugging using the emulation modeemulation mode– AdvantagesAdvantages
Can use host debuggersCan use host debuggersCan use otherwise disallowed Can use otherwise disallowed functions in device code, e.g. functions in device code, e.g. printfprintf
Device and host memory are both Device and host memory are both readable from either device or readable from either device or hosthost
Host Runtime Host Runtime ComponentComponent
Runtime API: debugging using the Runtime API: debugging using the emulation modeemulation mode– AdvantagesAdvantages
Any device or host specific Any device or host specific function can be called from either function can be called from either device or host codedevice or host code
Runtime detects incorrect use of Runtime detects incorrect use of synch functionssynch functions
Host Runtime Host Runtime ComponentComponent
Runtime API: debugging using the Runtime API: debugging using the emulation modeemulation mode– Some errors may still remain hiddenSome errors may still remain hidden
Memory access errorsMemory access errorsOut of context pointer operationsOut of context pointer operationsIncorrect outcome of warp vote Incorrect outcome of warp vote functions as warp size is 1 in functions as warp size is 1 in emulation modeemulation mode
Result of FP operations often Result of FP operations often different on host and devicedifferent on host and device
Host Runtime Host Runtime ComponentComponent
Driver API: Context managementDriver API: Context management– A context encapsulates all resources A context encapsulates all resources
and actions performed within the and actions performed within the driver APIdriver API
– Almost all CUDA functions operate in Almost all CUDA functions operate in a context, except those dealing witha context, except those dealing withDevice enumerationDevice enumerationContext managementContext management
Host Runtime Host Runtime ComponentComponent
Driver API: Context managementDriver API: Context management– Each host thread can have only one Each host thread can have only one
currentcurrent device context at a time device context at a time– Each host thread maintains a stack Each host thread maintains a stack
of current contextsof current contexts– cuCtxCreate()cuCtxCreate()
Creates a contextCreates a contextPushes it to the top of the stackPushes it to the top of the stackMakes it the current contextMakes it the current context
Host Runtime Host Runtime ComponentComponent
Driver API: Context managementDriver API: Context management– cuCtxPopCurrent()cuCtxPopCurrent()
Detaches the current context from Detaches the current context from the host thread – makes it the host thread – makes it “uncurrent”“uncurrent”
The context is now The context is now floatingfloatingIt can be pushed to any host It can be pushed to any host thread's stackthread's stack
Host Runtime Host Runtime ComponentComponent
Driver API: Context managementDriver API: Context management– Each context has a Each context has a usage countusage count
cuCtxCreate cuCtxCreate creates a context creates a context with a usage count of 1with a usage count of 1
cuCtxAttach cuCtxAttach increments the increments the usage count usage count
cuCtxDetach cuCtxDetach decrements the decrements the usage count usage count
Host Runtime Host Runtime ComponentComponent
Driver API: Context managementDriver API: Context management– A context is destroyed when its A context is destroyed when its
usage count reaches 0.usage count reaches 0.cuCtxDetach, cuCtxDestroycuCtxDetach, cuCtxDestroy
Host Runtime Host Runtime ComponentComponent
Driver API: Module managementDriver API: Module management– Modules are dynamically loadable Modules are dynamically loadable
packages of device code and data packages of device code and data output by nvccoutput by nvccSimilar to DLLsSimilar to DLLs
Host Runtime Host Runtime ComponentComponent
Driver API: Module managementDriver API: Module management– Dynamically loading a module and Dynamically loading a module and
accessing its contentsaccessing its contentsCUmodule cuModule;CUmodule cuModule;cuModuleLoad(&cuModule, cuModuleLoad(&cuModule, “myModule.cubin”);“myModule.cubin”);CUfunction cuFunction;CUfunction cuFunction;cuModuleGetFunction(&cuFunction, cuModuleGetFunction(&cuFunction, cuModule, “myKernel”);cuModule, “myKernel”);
Host Runtime Host Runtime ComponentComponent
Driver API: Execution controlDriver API: Execution control– Set kernel parametersSet kernel parameters
cuFuncSetBlockShape()cuFuncSetBlockShape()–#threads/block for the function#threads/block for the function–How thread IDs are assignedHow thread IDs are assigned
cuFuncSetSharedSize()cuFuncSetSharedSize()–Size of shared memorySize of shared memory
cuParam*()cuParam*()–Specify other parameters for Specify other parameters for next kernel launchnext kernel launch
Host Runtime Host Runtime ComponentComponent
Driver API: Execution controlDriver API: Execution control– Launch kernelLaunch kernel
cuLaunch(), cuLaunchGrid()cuLaunch(), cuLaunchGrid()– Example 4.5.3.5 in Prog GuideExample 4.5.3.5 in Prog Guide
Host Runtime Host Runtime ComponentComponent
Final ProjectsFinal Projects
Ideas?Ideas?– DES crackerDES cracker– Image editorImage editor
Resize and smooth an imageResize and smooth an image Gamut mapping?Gamut mapping?
– 3D Shape matching3D Shape matching
All for todayAll for today
Next timeNext time– Memory and Instruction optimizationsMemory and Instruction optimizations