15
9/25/2011 1 Memory Management Bedřich Beneš, Ph.D. Purdue University Department of Computer Graphics © Bedrich Benes Memory Access Bandwidth Host and device different memory spaces How fast is the access? (2009 CPU Intel Pentium i7, GT200) CPU – Memory approx 20 GB/sec GPU – Main memory 2x 4 GB/sec (r/w) GPU – GDRAM approx 150 GB/sec GTX 560 130 GB/s 1/2 GB GTX 580 190 GB/s 1.5 GB © Bedrich Benes Memory Spaces Host manages its own memory and some device memory Device manages its own memory Host manages data copy between host and device and d2d © Image courtesy of NVIDIA © Bedrich Benes Memory Spaces © Image courtesy of NVIDIA

matrix2D

Embed Size (px)

Citation preview

Page 1: matrix2D

9/25/2011

1

Memory ManagementBedřich Beneš, Ph.D.Purdue UniversityDepartment of Computer Graphics

© Bedrich Benes

Memory Access Bandwidth• Host and device different memory spaces• How fast is the access?• (2009 CPU Intel Pentium i7, GT200)

CPU – Memory approx 20 GB/secGPU – Main memory 2x 4 GB/sec (r/w)GPU – GDRAM approx 150 GB/sec

• GTX 560 130 GB/s 1/2 GB• GTX 580 190 GB/s 1.5 GB

© Bedrich Benes

Memory Spaces• Host manages its own memory and

some device memory• Device manages its own memory• Host manages data

copy between host and device and d2d

© Image courtesy of NVIDIA © Bedrich Benes

Memory Spaces

© Image courtesy of NVIDIA

Page 2: matrix2D

9/25/2011

2

© Bedrich Benes

Memory Spaces• Main Memory ⟺ L3 cache

200 cycles, 20-30GB/sec• L3 cache ⇔ L1/L2 cache

25-35 cycles, • L1/L2 cache ⇔ registers

5-12 cycles

© Image courtesy of NVIDIA © Bedrich Benes

Memory Spaces1) Global Memory (R/W)• Slow. • Can be accessed by all threads. • Can be ~150x slower than SM.• Accessible from device and host.• Lives with the application.

© Bedrich Benes

Memory Spaces2) Constant Memory (R)• Fast read when all threads access the

same location.• Can be accessed by all threads. • Accessible from device and host.• Lives with the application.

© Bedrich Benes

Memory Spaces3) Shared Memory (R/W)• On-chip. Very fast. • Allocated to thread blocks. • As fast as register if

there are no bank conflicts or not reading from the same space.

• Accessible by ANY thread within blockdies with the block.

• Accessible from device.

Page 3: matrix2D

9/25/2011

3

© Bedrich Benes

Memory Spaces4) Registers (R/W)• On-chip. • Very fast. • Allocated to a thread and dies with it.• Accessible from device

© Bedrich Benes

Memory Spaces5) Local Memory (R/W)• Can be 150x slower than SM.• Accessible by one thread dies with it.• Accessible from device.

© Bedrich Benes

Memory Spaces5) Texture Memory (R)• Can be 150x slower than SM.• Cached on chip – can be fast.• Accessible to all threads.• Lives with the application.• Has special functions for look-up. • Accessible from device and host.

© Bedrich Benes

Memory Spaces

SM vs global/local memory:GPU access mem command: 4 clock cycleslocal/global memory access: 400-600 cyclesGPU SM access: 4 clock cyclesSM access is approx 100-150x faster!!!

Page 4: matrix2D

9/25/2011

4

© Bedrich Benes

Memory SpacesLocal variables are by default in registers.

If too many local resources are used, compiler can locate a variable into the local memory

© Bedrich Benes

Memory Spaces• How do I know where my variable lives?• Compile with –ptx or –keep

parameter and see the assembly code

© Bedrich Benes

Memory Spaces• the assembly code

.reg .u16 %rh<6>;//register unsigned int16

.reg .u32 %r<29>;//register unsigned int32

.reg .f32 %f<24>;//register float 32

.loc 2 222 0 //local variable

© Bedrich Benes

Device and Host Pointers• We will follow the simple rules of thumb

• Starting a variable with d indicates it points to deviceh indicates it points to host

float *dPtr; //pointed to devicefloat *hPtr; //pointer to host

Page 5: matrix2D

9/25/2011

5

© Bedrich Benes

Device and Host PointersBoth pointers live in the host memoryBut they point to different spaces

*dPtr; *hPtr; Device MemoryHost Memory

*hPtr

*dPtr

© Bedrich Benes

Device and Host PointersHost pointers are accessed/manipulated by

standard C/C++ constructs malloc, free, new, delete

Device pointers cannot be used in the same way. They need special functions.

© Bedrich Benes

GPU Linear Memory 1D

Example:int n=128;int size=n*sizeof(float);int *dA;cudaMalloc((void **)&da,size); cudaMemset(da,0,size);cudaFree(da);

cudaMalloc(void **ptr, sizet_t n)cudaMemset(void **ptr, int val, sizet_t n)cudaFree(void *ptr)

© Bedrich Benes

Data Copy Linear Memory 1D

• enum cudaMemoryKind• cudaMemorycpyHostToDevice• cudaMemorycpyDeviceToHost• cudaMemorycpyDeviceToDevice

cudaMemcpy(void *dst, void *src, size_t n,enum cudaMemoryKind direction)

Page 6: matrix2D

9/25/2011

6

© Bedrich Benes

Data Copy Linear Memory 1D• Does NOT start until all CUDA calls

complete (synchronous)• Does NOT let CPU work, while copying

(blocks CPU thread)• It is a “safe call”• Note:

Asynchronous calls exist in CUDA

© Bedrich Benes

Data Copy 1D Examplefloat* h_A,* h_B,* h_C; //host ptsfloat* d_A,* d_B,* d_C; //device ptrsint N = 50000;size_t size = N * sizeof(float);

// Allocate input vectors h_A and h_B in host memoryh_A = (float*)malloc(size);h_B = (float*)malloc(size);h_C = (float*)malloc(size);for (int i=0;i<N;i++) {

a[i]= (float)i/N;b[i]=1-(float)i/N;

}

© Bedrich Benes

Data Copy 1D Example// Allocate vectors in device memorycudaMalloc((void**)&d_A,size);cudaMalloc((void**)&d_B,size);cudaMalloc((void**)&d_C,size);

// Copy vectors from host memory to device memorycudaMemcpy(d_A,h_A,size,cudaMemcpyHostToDevice);cudaMemcpy(d_B,h_B,size,cudaMemcpyHostToDevice);//kernel would be executed herecudaMemcpy(d_B,d_A,size,cudaMemcpyDeviceToDevice);// Copy result from device memory to host memorycudaMemcpy(h_C,d_C,size,cudaMemcpyDeviceToHost);

© Bedrich Benes

GPU Linear Memory 2D

• Used for 2D arrays of width x height• GPU performs better when the data is

correctly aligned (on multiples of 2^n)• “pitch” says what is being used per row

(can be bigger than the column expected)• It pads the allocation for a good performance

cudaMallocPitch(void **ptr, sizet_t &pitch,size_t width, size_t height)

Page 7: matrix2D

9/25/2011

7

© Bedrich Benes

GPU Linear Memory 2D• Having an array of 12 float rows,

CUDA may pad it to pitch=16 floats:

an array row:[ ]

you have to deal with this while using it

1 2 3 4 5 6 7 8 8 10 11 12

© Bedrich Benes

GPU Linear Memory 2D

columnspitch

rows

© Bedrich Benes

GPU Linear Memory 2D2D Memory allocation:

const int w=h=500;//with and height are the samefloat *dPtr, a[w][h];size_t pitch; //size_t is importanterror=cudaMallocPitch((void**)&dPtr;&pitch,w,h);//check the errorKernel<<<100,512>>>(dPtr,pitch,w,h);Kernel2<<<100,512>>>(dPtr,pitch);

© Bedrich Benes

GPU Linear Memory 2D2D Memory Access

row and column are known (from the kernel indices), pitch is known from the allocation, and the element is of type *T.

Its location is:T* pElement

=(T*)((char*)baseAddress+row*pitch)+column;

Page 8: matrix2D

9/25/2011

8

© Bedrich Benes

GPU Linear Memory 2D//kernel with nested cycles__global__ void Kernel(float *dPtr, int pitch,

int w, int h){

for (int r=0;r<h;r++){

float *row=(float*)((char*)dPtr+r*pitch);for (int c=0;c<w;c++){float element=row[c];//do some operation

}//of for c}//of for r

}//of Kernel

© Bedrich Benes

GPU Linear Memory 2D//kernel with implicit indexing

__global__ void Kernel2(float *dPtr, int pitch){

int i=blockDim.x*blockIdx.x+threadIdx.x;int j=blockDim.y*blockIdx.y+threadIdx.y;if ((i>=N) || (j>=N)) return;float *elm=(float*)((char*)dPtr+j*pitch)+i;*elm=0.5;//sets the value in the 2D array

}//of Kernel

© Bedrich Benes

Data Copy Linear Memory 2D

• enum cudaMemoryKind• cudaMemorycpyHostToDevice• cudaMemorycpyDeviceToHost• cudaMemorycpyDeviceToDevice

cudaMemcpy2D(void *dst, size_t dpitch,void *src, size_t spitch,size_t w, size_t h,enum cudaMemoryKind direction)

© Bedrich Benes

Data Copy Linear Memory 2D• uses two pitch values,

one for the source and one for the destination

• in the host memory, the pitch is usually the size of the row (in bytes)

Page 9: matrix2D

9/25/2011

9

© Bedrich Benes

Data Copy 2D Exampleconst int MAX=500;//will need to be pitchedfloat a[MAX][MAX]float *dPtr;size_t pitch;int maxBytes=MAX*sizeof(float);error=cudaMallocPitch((void**)&dPtr,&pitch,maxBytes,MAX);error=cudaMemcpy2D(dPtr,pitch,

a,maxBytes, maxBytes,MAX,cudaMemcpyHostToDevice);

Kernel2<<<100,200>>>(dPtr,pitch);error=cudaMemcpy2D(a,maxBytes,

dPtr,pitch,maxBytes,MAX,cudaMemcpyDeviceToHost);

© Bedrich Benes

GPU Linear Memory 3D

cudaMemcpy3D(const struct cudaMemcpy3DParms *p)

cudaMalloc3D(struct cudaPitchedPtr *pitchedDevPtr,struct cudaExtent extent)

© Bedrich Benes

Constant Memory• Similar to global variables. Read only.• 64kB only, but very useful.

• Defined with global scope within the kernel file __constant__

• Initialised by the host cudaMemcpyToSymbol, cudaMemcpyFromSymbol

© Bedrich Benes

Constant Memory• Similar to global variables. Read only.• Defined with global scope within the

kernel file __constant__

• Initialised by the host cudaMemcpyToSymbol, cudaMemcpyFromSymbol

Page 10: matrix2D

9/25/2011

10

© Bedrich Benes

Constant Variables• const float PI=3.14159;

• Will be in registers, as long as there is enough space.

• Will not be in the constant memory.

© Bedrich Benes

Page-locked (Pinned) Memory• is on the host

• can be read by the GPU directly and processed concurrently with the kernel execution

• useful for single read

© Bedrich Benes

Page-locked (Pinned) Memory

• attribs can be:• cudaHostAllocWriteCombined• cudaHostAllocMapped• cudaHostAllocPortable

cudaAllocHost(void **pH,sizet_t n,int attribs)cudaFreeHost(void *pH)

© Bedrich Benes

Portable Memory• page-locked memory has the benefits

only for the host thread that created it

• by making it portable the memory is available for all host threads

Page 11: matrix2D

9/25/2011

11

© Bedrich Benes

Write-Combining memory• page-locked memory uses L1 and L2• cudaHostAllocWriteCombined

makes it write-combined• does not use cache

(more cache for other things)• not snooped during PCIEx (40% faster)• reading from the host is slow

should be used for host writes only© Bedrich Benes

Mapped Memory• some devices can map the page-locked

memory to the device address space → no need for reads/writes between host and device!

• the same page has two pointers one for the host and one for the device

• multiple GPUs can access the same page

© Bedrich Benes

Mapped Memory•

• maps the pointer at host pH(taken from cudaMallocHost()) and maps it to the device space pD

• flags is unused for now

cudaHostGetDevicePointer(void **pD,void *pHost,unsigned int flags)

© Bedrich Benes

Mapped Memory#if CUDART_VERSION<2020 #error “No support for mapped memory!\n“#endif//Check if device 0 supports mapped memorycudaDeviceProp devProp;cudaGetDeviceProperties(&devProp,0); if(!devProp.canMapHostMemory) {printf("Device cannot map host memory!\n”);exit(EXIT_FAILURE);

}

Page 12: matrix2D

9/25/2011

12

© Bedrich Benes

Mapped Memorysize_t=1024*sizeof(float);float *aH,*aD;

cudaHostAlloc((void **)&aH, size,cudaHostAllocMapped);

//Get the device pointers to memory mapped cudaHostGetDevicePointer((void **)&aD,

(void *)aH,0);

© Bedrich Benes

Page-locked (Pinned) Memory• Speedup ?vector addition on Quadro FX 770M10,000x reading and writing the results

© Bedrich Benes

Texture Memory• can be faster than the global memory• is a global memory with cached access• (FERMI caches global memory as well)• cache is optimized for 2D spatial locality• designed for streaming fetches• read by kernel using texture fetches

© Bedrich Benes

Texture Memory• texture reference is an object• texture must be bounded, has attributes• texture can be linear mem or CUDA array• texture can be shared with OpenGL

Page 13: matrix2D

9/25/2011

13

© Bedrich Benes

Texture Declaration

• type: float, basic integer• dim: 1,2,3• readMode: cudaReadModeNormalizedFloatranges: [0,1] or [-1,1]cudaReadModeElementTyperanges: 0…0XFF

texture<type,dim,readMode> texRef;

© Bedrich Benes

Texture Declaration• NTC (normalized texture coordinates)

are in range [0,1]• using floating point textures allows for

• wrapping• filtering

• using integer textures –outside values are clamped

© Bedrich Benes

Texture Binding

offset - returned because of alignmenttexref – the texture to binddevPtr – memory address on the devicedesc – channel formatsize – size of the memory

cudaBindTexture(size_t *offset,const struct textureReference *texref,const void *devPtr,const struct cudaChannelFormatDesc *desc,size_t size)

© Bedrich Benes

Texture Bindingstruct textureReference{int normalized; enum cudaTextureFilterMode filterMode,enum cudaTextureAddressMode addressMode[3];struct cudaChannelFormatDesc channelDesc;}

normalized:~ if 0, values are [0,…,width-1]x[0,…,height-1]x[0,…,depth-1]~ if 1, values are [0,1]3

Page 14: matrix2D

9/25/2011

14

© Bedrich Benes

Texture Bindingfilter mode: specifies the filtering mode

cudaFilterModePoint –nearest neighbor sampling

cudaFilterModeLinear –(bi/tri) linear intrpolation

(valid only for floating point types)© Bedrich Benes

Texture Bindingaddress mode: defines what values out of range

cudaAddressModeClamp –clamped to the valid range

cudaAddressModeWrap –wrapped to the valid range

(valid only for floating point types)

© Bedrich Benes

Texture Bindingchannel Description:

cudaChannelFormatKind –cudaChannelFormatKindSignedcudaChannelFormatKindUnsignedcudaChannelFormatKindFloat

struct cudaChannelFormatDesc(int x,y,z,w; # of bits per componentenum cudaChannelFormatKind f)

© Bedrich Benes

Texture Binding ExamplecudaChannelFormatDesc channelDesc=cudaCreateChannelDesc(32,0,0,0,cudaChannelFormatKindFloat);

cudaArray* cu_array;//cuda arraycudaMallocArray(&cu_array,&channelDesc,width,height); cudaMemcpyToArray(cu_array,0,0,h_data,size,

cudaMemcpyHostToDevice));tex.addressMode[0]=cudaAddressModeWrap;tex.addressMode[1]=cudaAddressModeWrap;tex.filterMode=cudaFilterModeLinear;tex.normalized=true;cudaBindTextureToArray(tex,cu_array,channelDesc);//there is no “input” of the kernel, it is the textureBlurKernel<<<dimGrid,dimBlock>>>(d_data,width,height);

Page 15: matrix2D

9/25/2011

15

© Bedrich Benes

Texture Binding Exampletexture<float,2,cudaReadModeElementType> tex;__global__ void BlurKernel(float* d_data,int w, int h) {unsigned int x=blockIdx.x*blockDim.x + threadIdx.x;unsigned int y=blockIdx.y*blockDim.y + threadIdx.y;float u=x/(float)w;float v=y/(float)h;//the textel itself//plus minus one in botm u and vfloat up=(x+1)/(float)w;float um=(x-1)/(float)w;float vp=(y+1)/(float)h;float vm=(y-1)/(float)h;

//read from texture, sum all nine neighbors, divide by nine //and write to global memoryd_data[y*width + x]=(tex2D(tex,u,v)+tex2D(tex,up,v)+tex2D(tex,um,v)+tex2D(tex,u,vp)+tex2D(tex,u,vm)+tex2D(tex,up,vp)+tex2D(tex,up,vm)+tex2D(tex,um,vp)+tex2D(tex,um,vm))/9.f;}

© Bedrich Benes

Texture Memory• pretty complicated setup• can be faster than the global memory

(it is cached)• can use Cuda Arrays• good for large read-only inputs

© Bedrich Benes

Reading• CUDA Programming Guide• Kirk, D.B., Hwu, W.W.,

Programming Massively Parallel Processors, NVIDIA, Morgan Kaufmann 2010

• Sanders, J., Kandrot, E.,CUDA by Example, Addison-Wesley