matrix2D

9/25/2011

1

Memory ManagementBedřich Beneš, Ph.D.Purdue UniversityDepartment of Computer Graphics

© Bedrich Benes

Memory Access Bandwidth• Host and device different memory spaces• How fast is the access?• (2009 CPU Intel Pentium i7, GT200)

CPU – Memory approx 20 GB/secGPU – Main memory 2x 4 GB/sec (r/w)GPU – GDRAM approx 150 GB/sec

• GTX 560 130 GB/s 1/2 GB• GTX 580 190 GB/s 1.5 GB

© Bedrich Benes

Memory Spaces• Host manages its own memory and

some device memory• Device manages its own memory• Host manages data

copy between host and device and d2d

© Image courtesy of NVIDIA © Bedrich Benes

Memory Spaces

© Image courtesy of NVIDIA

9/25/2011

2

© Bedrich Benes

Memory Spaces• Main Memory ⟺ L3 cache

200 cycles, 20-30GB/sec• L3 cache ⇔ L1/L2 cache

25-35 cycles, • L1/L2 cache ⇔ registers

5-12 cycles

© Image courtesy of NVIDIA © Bedrich Benes

Memory Spaces1) Global Memory (R/W)• Slow. • Can be accessed by all threads. • Can be ~150x slower than SM.• Accessible from device and host.• Lives with the application.

© Bedrich Benes

Memory Spaces2) Constant Memory (R)• Fast read when all threads access the

same location.• Can be accessed by all threads. • Accessible from device and host.• Lives with the application.

© Bedrich Benes

Memory Spaces3) Shared Memory (R/W)• On-chip. Very fast. • Allocated to thread blocks. • As fast as register if

there are no bank conflicts or not reading from the same space.

• Accessible by ANY thread within blockdies with the block.

• Accessible from device.

9/25/2011

3

© Bedrich Benes

Memory Spaces4) Registers (R/W)• On-chip. • Very fast. • Allocated to a thread and dies with it.• Accessible from device

© Bedrich Benes

Memory Spaces5) Local Memory (R/W)• Can be 150x slower than SM.• Accessible by one thread dies with it.• Accessible from device.

© Bedrich Benes

Memory Spaces5) Texture Memory (R)• Can be 150x slower than SM.• Cached on chip – can be fast.• Accessible to all threads.• Lives with the application.• Has special functions for look-up. • Accessible from device and host.

© Bedrich Benes

Memory Spaces

SM vs global/local memory:GPU access mem command: 4 clock cycleslocal/global memory access: 400-600 cyclesGPU SM access: 4 clock cyclesSM access is approx 100-150x faster!!!

9/25/2011

4

© Bedrich Benes

Memory SpacesLocal variables are by default in registers.

If too many local resources are used, compiler can locate a variable into the local memory

© Bedrich Benes

Memory Spaces• How do I know where my variable lives?• Compile with –ptx or –keep

parameter and see the assembly code

© Bedrich Benes

Memory Spaces• the assembly code

.reg .u16 %rh<6>;//register unsigned int16

.reg .u32 %r<29>;//register unsigned int32

.reg .f32 %f<24>;//register float 32

.loc 2 222 0 //local variable

© Bedrich Benes

Device and Host Pointers• We will follow the simple rules of thumb

• Starting a variable with d indicates it points to deviceh indicates it points to host

float *dPtr; //pointed to devicefloat *hPtr; //pointer to host

9/25/2011

5

© Bedrich Benes

Device and Host PointersBoth pointers live in the host memoryBut they point to different spaces

*dPtr; *hPtr; Device MemoryHost Memory

*hPtr

*dPtr

© Bedrich Benes

Device and Host PointersHost pointers are accessed/manipulated by

standard C/C++ constructs malloc, free, new, delete

Device pointers cannot be used in the same way. They need special functions.

© Bedrich Benes

GPU Linear Memory 1D

Example:int n=128;int size=n*sizeof(float);int *dA;cudaMalloc((void **)&da,size); cudaMemset(da,0,size);cudaFree(da);

cudaMalloc(void **ptr, sizet_t n)cudaMemset(void **ptr, int val, sizet_t n)cudaFree(void *ptr)

© Bedrich Benes

Data Copy Linear Memory 1D

• enum cudaMemoryKind• cudaMemorycpyHostToDevice• cudaMemorycpyDeviceToHost• cudaMemorycpyDeviceToDevice

cudaMemcpy(void *dst, void *src, size_t n,enum cudaMemoryKind direction)

9/25/2011

6

© Bedrich Benes

Data Copy Linear Memory 1D• Does NOT start until all CUDA calls

complete (synchronous)• Does NOT let CPU work, while copying

(blocks CPU thread)• It is a “safe call”• Note:

Asynchronous calls exist in CUDA

© Bedrich Benes

Data Copy 1D Examplefloat* h_A,* h_B,* h_C; //host ptsfloat* d_A,* d_B,* d_C; //device ptrsint N = 50000;size_t size = N * sizeof(float);

// Allocate input vectors h_A and h_B in host memoryh_A = (float*)malloc(size);h_B = (float*)malloc(size);h_C = (float*)malloc(size);for (int i=0;i<N;i++) {

a[i]= (float)i/N;b[i]=1-(float)i/N;

}

© Bedrich Benes

Data Copy 1D Example// Allocate vectors in device memorycudaMalloc((void**)&d_A,size);cudaMalloc((void**)&d_B,size);cudaMalloc((void**)&d_C,size);

// Copy vectors from host memory to device memorycudaMemcpy(d_A,h_A,size,cudaMemcpyHostToDevice);cudaMemcpy(d_B,h_B,size,cudaMemcpyHostToDevice);//kernel would be executed herecudaMemcpy(d_B,d_A,size,cudaMemcpyDeviceToDevice);// Copy result from device memory to host memorycudaMemcpy(h_C,d_C,size,cudaMemcpyDeviceToHost);

© Bedrich Benes


• Used for 2D arrays of width x height• GPU performs better when the data is

correctly aligned (on multiples of 2^n)• “pitch” says what is being used per row

(can be bigger than the column expected)• It pads the allocation for a good performance

cudaMallocPitch(void **ptr, sizet_t &pitch,size_t width, size_t height)

9/25/2011

7

© Bedrich Benes

GPU Linear Memory 2D• Having an array of 12 float rows,

CUDA may pad it to pitch=16 floats:

an array row:[ ]

you have to deal with this while using it

1 2 3 4 5 6 7 8 8 10 11 12

© Bedrich Benes


columnspitch

rows

© Bedrich Benes

GPU Linear Memory 2D2D Memory allocation:

const int w=h=500;//with and height are the samefloat *dPtr, a[w][h];size_t pitch; //size_t is importanterror=cudaMallocPitch((void**)&dPtr;&pitch,w,h);//check the errorKernel<<<100,512>>>(dPtr,pitch,w,h);Kernel2<<<100,512>>>(dPtr,pitch);

© Bedrich Benes

GPU Linear Memory 2D2D Memory Access

row and column are known (from the kernel indices), pitch is known from the allocation, and the element is of type *T.

Its location is:T* pElement

=(T*)((char*)baseAddress+row*pitch)+column;

9/25/2011

8

© Bedrich Benes

GPU Linear Memory 2D//kernel with nested cycles__global__ void Kernel(float *dPtr, int pitch,

int w, int h){

for (int r=0;r<h;r++){

float *row=(float*)((char*)dPtr+r*pitch);for (int c=0;c<w;c++){float element=row[c];//do some operation

}//of for c}//of for r

}//of Kernel

© Bedrich Benes

GPU Linear Memory 2D//kernel with implicit indexing

__global__ void Kernel2(float *dPtr, int pitch){

int i=blockDim.x*blockIdx.x+threadIdx.x;int j=blockDim.y*blockIdx.y+threadIdx.y;if ((i>=N) || (j>=N)) return;float *elm=(float*)((char*)dPtr+j*pitch)+i;*elm=0.5;//sets the value in the 2D array

}//of Kernel

© Bedrich Benes

Data Copy Linear Memory 2D

• enum cudaMemoryKind• cudaMemorycpyHostToDevice• cudaMemorycpyDeviceToHost• cudaMemorycpyDeviceToDevice

cudaMemcpy2D(void *dst, size_t dpitch,void *src, size_t spitch,size_t w, size_t h,enum cudaMemoryKind direction)

© Bedrich Benes

Data Copy Linear Memory 2D• uses two pitch values,

one for the source and one for the destination

• in the host memory, the pitch is usually the size of the row (in bytes)

9/25/2011

9

© Bedrich Benes

Data Copy 2D Exampleconst int MAX=500;//will need to be pitchedfloat a[MAX][MAX]float *dPtr;size_t pitch;int maxBytes=MAX*sizeof(float);error=cudaMallocPitch((void**)&dPtr,&pitch,maxBytes,MAX);error=cudaMemcpy2D(dPtr,pitch,

a,maxBytes, maxBytes,MAX,cudaMemcpyHostToDevice);

Kernel2<<<100,200>>>(dPtr,pitch);error=cudaMemcpy2D(a,maxBytes,

dPtr,pitch,maxBytes,MAX,cudaMemcpyDeviceToHost);

© Bedrich Benes


cudaMemcpy3D(const struct cudaMemcpy3DParms *p)

cudaMalloc3D(struct cudaPitchedPtr *pitchedDevPtr,struct cudaExtent extent)

© Bedrich Benes

Constant Memory• Similar to global variables. Read only.• 64kB only, but very useful.

• Defined with global scope within the kernel file __constant__

• Initialised by the host cudaMemcpyToSymbol, cudaMemcpyFromSymbol

© Bedrich Benes

Constant Memory• Similar to global variables. Read only.• Defined with global scope within the

kernel file __constant__

• Initialised by the host cudaMemcpyToSymbol, cudaMemcpyFromSymbol

9/25/2011

10

© Bedrich Benes

Constant Variables• const float PI=3.14159;

• Will be in registers, as long as there is enough space.

• Will not be in the constant memory.

© Bedrich Benes

Page-locked (Pinned) Memory• is on the host

• can be read by the GPU directly and processed concurrently with the kernel execution

• useful for single read

© Bedrich Benes

Page-locked (Pinned) Memory

• attribs can be:• cudaHostAllocWriteCombined• cudaHostAllocMapped• cudaHostAllocPortable

cudaAllocHost(void **pH,sizet_t n,int attribs)cudaFreeHost(void *pH)

© Bedrich Benes

Portable Memory• page-locked memory has the benefits

only for the host thread that created it

• by making it portable the memory is available for all host threads

9/25/2011

11

© Bedrich Benes

Write-Combining memory• page-locked memory uses L1 and L2• cudaHostAllocWriteCombined

makes it write-combined• does not use cache

(more cache for other things)• not snooped during PCIEx (40% faster)• reading from the host is slow

should be used for host writes only© Bedrich Benes

Mapped Memory• some devices can map the page-locked

memory to the device address space → no need for reads/writes between host and device!

• the same page has two pointers one for the host and one for the device

• multiple GPUs can access the same page

© Bedrich Benes

Mapped Memory•

• maps the pointer at host pH(taken from cudaMallocHost()) and maps it to the device space pD

• flags is unused for now

cudaHostGetDevicePointer(void **pD,void *pHost,unsigned int flags)

© Bedrich Benes

Mapped Memory#if CUDART_VERSION<2020 #error “No support for mapped memory!\n“#endif//Check if device 0 supports mapped memorycudaDeviceProp devProp;cudaGetDeviceProperties(&devProp,0); if(!devProp.canMapHostMemory) {printf("Device cannot map host memory!\n”);exit(EXIT_FAILURE);

}

9/25/2011

12

© Bedrich Benes

Mapped Memorysize_t=1024*sizeof(float);float *aH,*aD;

cudaHostAlloc((void **)&aH, size,cudaHostAllocMapped);

//Get the device pointers to memory mapped cudaHostGetDevicePointer((void **)&aD,

(void *)aH,0);

© Bedrich Benes

Page-locked (Pinned) Memory• Speedup ?vector addition on Quadro FX 770M10,000x reading and writing the results

© Bedrich Benes

Texture Memory• can be faster than the global memory• is a global memory with cached access• (FERMI caches global memory as well)• cache is optimized for 2D spatial locality• designed for streaming fetches• read by kernel using texture fetches

© Bedrich Benes

Texture Memory• texture reference is an object• texture must be bounded, has attributes• texture can be linear mem or CUDA array• texture can be shared with OpenGL

9/25/2011

13

© Bedrich Benes

Texture Declaration

• type: float, basic integer• dim: 1,2,3• readMode: cudaReadModeNormalizedFloatranges: [0,1] or [-1,1]cudaReadModeElementTyperanges: 0…0XFF

texture<type,dim,readMode> texRef;

© Bedrich Benes

Texture Declaration• NTC (normalized texture coordinates)

are in range [0,1]• using floating point textures allows for

• wrapping• filtering

• using integer textures –outside values are clamped

© Bedrich Benes

Texture Binding

offset - returned because of alignmenttexref – the texture to binddevPtr – memory address on the devicedesc – channel formatsize – size of the memory

cudaBindTexture(size_t *offset,const struct textureReference *texref,const void *devPtr,const struct cudaChannelFormatDesc *desc,size_t size)

© Bedrich Benes

Texture Bindingstruct textureReference{int normalized; enum cudaTextureFilterMode filterMode,enum cudaTextureAddressMode addressMode[3];struct cudaChannelFormatDesc channelDesc;}

normalized:~ if 0, values are [0,…,width-1]x[0,…,height-1]x[0,…,depth-1]~ if 1, values are [0,1]3

9/25/2011

14

© Bedrich Benes

Texture Bindingfilter mode: specifies the filtering mode

cudaFilterModePoint –nearest neighbor sampling

cudaFilterModeLinear –(bi/tri) linear intrpolation

(valid only for floating point types)© Bedrich Benes

Texture Bindingaddress mode: defines what values out of range

cudaAddressModeClamp –clamped to the valid range

cudaAddressModeWrap –wrapped to the valid range

(valid only for floating point types)

© Bedrich Benes

Texture Bindingchannel Description:

cudaChannelFormatKind –cudaChannelFormatKindSignedcudaChannelFormatKindUnsignedcudaChannelFormatKindFloat

struct cudaChannelFormatDesc(int x,y,z,w; # of bits per componentenum cudaChannelFormatKind f)

© Bedrich Benes

Texture Binding ExamplecudaChannelFormatDesc channelDesc=cudaCreateChannelDesc(32,0,0,0,cudaChannelFormatKindFloat);

cudaArray* cu_array;//cuda arraycudaMallocArray(&cu_array,&channelDesc,width,height); cudaMemcpyToArray(cu_array,0,0,h_data,size,

cudaMemcpyHostToDevice));tex.addressMode[0]=cudaAddressModeWrap;tex.addressMode[1]=cudaAddressModeWrap;tex.filterMode=cudaFilterModeLinear;tex.normalized=true;cudaBindTextureToArray(tex,cu_array,channelDesc);//there is no “input” of the kernel, it is the textureBlurKernel<<<dimGrid,dimBlock>>>(d_data,width,height);

9/25/2011

15

© Bedrich Benes

Texture Binding Exampletexture<float,2,cudaReadModeElementType> tex;__global__ void BlurKernel(float* d_data,int w, int h) {unsigned int x=blockIdx.x*blockDim.x + threadIdx.x;unsigned int y=blockIdx.y*blockDim.y + threadIdx.y;float u=x/(float)w;float v=y/(float)h;//the textel itself//plus minus one in botm u and vfloat up=(x+1)/(float)w;float um=(x-1)/(float)w;float vp=(y+1)/(float)h;float vm=(y-1)/(float)h;

//read from texture, sum all nine neighbors, divide by nine //and write to global memoryd_data[y*width + x]=(tex2D(tex,u,v)+tex2D(tex,up,v)+tex2D(tex,um,v)+tex2D(tex,u,vp)+tex2D(tex,u,vm)+tex2D(tex,up,vp)+tex2D(tex,up,vm)+tex2D(tex,um,vp)+tex2D(tex,um,vm))/9.f;}

© Bedrich Benes

Texture Memory• pretty complicated setup• can be faster than the global memory

(it is cached)• can use Cuda Arrays• good for large read-only inputs

© Bedrich Benes

Reading• CUDA Programming Guide• Kirk, D.B., Hwu, W.W.,

Programming Massively Parallel Processors, NVIDIA, Morgan Kaufmann 2010

• Sanders, J., Kandrot, E.,CUDA by Example, Addison-Wesley

Documents

matrix2D