Upload
eduardo-moreira
View
7
Download
0
Embed Size (px)
Citation preview
9/25/2011
1
Memory ManagementBedřich Beneš, Ph.D.Purdue UniversityDepartment of Computer Graphics
© Bedrich Benes
Memory Access Bandwidth• Host and device different memory spaces• How fast is the access?• (2009 CPU Intel Pentium i7, GT200)
CPU – Memory approx 20 GB/secGPU – Main memory 2x 4 GB/sec (r/w)GPU – GDRAM approx 150 GB/sec
• GTX 560 130 GB/s 1/2 GB• GTX 580 190 GB/s 1.5 GB
© Bedrich Benes
Memory Spaces• Host manages its own memory and
some device memory• Device manages its own memory• Host manages data
copy between host and device and d2d
© Image courtesy of NVIDIA © Bedrich Benes
Memory Spaces
© Image courtesy of NVIDIA
9/25/2011
2
© Bedrich Benes
Memory Spaces• Main Memory ⟺ L3 cache
200 cycles, 20-30GB/sec• L3 cache ⇔ L1/L2 cache
25-35 cycles, • L1/L2 cache ⇔ registers
5-12 cycles
© Image courtesy of NVIDIA © Bedrich Benes
Memory Spaces1) Global Memory (R/W)• Slow. • Can be accessed by all threads. • Can be ~150x slower than SM.• Accessible from device and host.• Lives with the application.
© Bedrich Benes
Memory Spaces2) Constant Memory (R)• Fast read when all threads access the
same location.• Can be accessed by all threads. • Accessible from device and host.• Lives with the application.
© Bedrich Benes
Memory Spaces3) Shared Memory (R/W)• On-chip. Very fast. • Allocated to thread blocks. • As fast as register if
there are no bank conflicts or not reading from the same space.
• Accessible by ANY thread within blockdies with the block.
• Accessible from device.
9/25/2011
3
© Bedrich Benes
Memory Spaces4) Registers (R/W)• On-chip. • Very fast. • Allocated to a thread and dies with it.• Accessible from device
© Bedrich Benes
Memory Spaces5) Local Memory (R/W)• Can be 150x slower than SM.• Accessible by one thread dies with it.• Accessible from device.
© Bedrich Benes
Memory Spaces5) Texture Memory (R)• Can be 150x slower than SM.• Cached on chip – can be fast.• Accessible to all threads.• Lives with the application.• Has special functions for look-up. • Accessible from device and host.
© Bedrich Benes
Memory Spaces
SM vs global/local memory:GPU access mem command: 4 clock cycleslocal/global memory access: 400-600 cyclesGPU SM access: 4 clock cyclesSM access is approx 100-150x faster!!!
9/25/2011
4
© Bedrich Benes
Memory SpacesLocal variables are by default in registers.
If too many local resources are used, compiler can locate a variable into the local memory
© Bedrich Benes
Memory Spaces• How do I know where my variable lives?• Compile with –ptx or –keep
parameter and see the assembly code
© Bedrich Benes
Memory Spaces• the assembly code
.reg .u16 %rh<6>;//register unsigned int16
.reg .u32 %r<29>;//register unsigned int32
.reg .f32 %f<24>;//register float 32
.loc 2 222 0 //local variable
© Bedrich Benes
Device and Host Pointers• We will follow the simple rules of thumb
• Starting a variable with d indicates it points to deviceh indicates it points to host
float *dPtr; //pointed to devicefloat *hPtr; //pointer to host
9/25/2011
5
© Bedrich Benes
Device and Host PointersBoth pointers live in the host memoryBut they point to different spaces
*dPtr; *hPtr; Device MemoryHost Memory
*hPtr
*dPtr
© Bedrich Benes
Device and Host PointersHost pointers are accessed/manipulated by
standard C/C++ constructs malloc, free, new, delete
Device pointers cannot be used in the same way. They need special functions.
© Bedrich Benes
GPU Linear Memory 1D
Example:int n=128;int size=n*sizeof(float);int *dA;cudaMalloc((void **)&da,size); cudaMemset(da,0,size);cudaFree(da);
cudaMalloc(void **ptr, sizet_t n)cudaMemset(void **ptr, int val, sizet_t n)cudaFree(void *ptr)
© Bedrich Benes
Data Copy Linear Memory 1D
• enum cudaMemoryKind• cudaMemorycpyHostToDevice• cudaMemorycpyDeviceToHost• cudaMemorycpyDeviceToDevice
cudaMemcpy(void *dst, void *src, size_t n,enum cudaMemoryKind direction)
9/25/2011
6
© Bedrich Benes
Data Copy Linear Memory 1D• Does NOT start until all CUDA calls
complete (synchronous)• Does NOT let CPU work, while copying
(blocks CPU thread)• It is a “safe call”• Note:
Asynchronous calls exist in CUDA
© Bedrich Benes
Data Copy 1D Examplefloat* h_A,* h_B,* h_C; //host ptsfloat* d_A,* d_B,* d_C; //device ptrsint N = 50000;size_t size = N * sizeof(float);
// Allocate input vectors h_A and h_B in host memoryh_A = (float*)malloc(size);h_B = (float*)malloc(size);h_C = (float*)malloc(size);for (int i=0;i<N;i++) {
a[i]= (float)i/N;b[i]=1-(float)i/N;
}
© Bedrich Benes
Data Copy 1D Example// Allocate vectors in device memorycudaMalloc((void**)&d_A,size);cudaMalloc((void**)&d_B,size);cudaMalloc((void**)&d_C,size);
// Copy vectors from host memory to device memorycudaMemcpy(d_A,h_A,size,cudaMemcpyHostToDevice);cudaMemcpy(d_B,h_B,size,cudaMemcpyHostToDevice);//kernel would be executed herecudaMemcpy(d_B,d_A,size,cudaMemcpyDeviceToDevice);// Copy result from device memory to host memorycudaMemcpy(h_C,d_C,size,cudaMemcpyDeviceToHost);
© Bedrich Benes
GPU Linear Memory 2D
• Used for 2D arrays of width x height• GPU performs better when the data is
correctly aligned (on multiples of 2^n)• “pitch” says what is being used per row
(can be bigger than the column expected)• It pads the allocation for a good performance
cudaMallocPitch(void **ptr, sizet_t &pitch,size_t width, size_t height)
9/25/2011
7
© Bedrich Benes
GPU Linear Memory 2D• Having an array of 12 float rows,
CUDA may pad it to pitch=16 floats:
an array row:[ ]
you have to deal with this while using it
1 2 3 4 5 6 7 8 8 10 11 12
© Bedrich Benes
GPU Linear Memory 2D
columnspitch
rows
© Bedrich Benes
GPU Linear Memory 2D2D Memory allocation:
const int w=h=500;//with and height are the samefloat *dPtr, a[w][h];size_t pitch; //size_t is importanterror=cudaMallocPitch((void**)&dPtr;&pitch,w,h);//check the errorKernel<<<100,512>>>(dPtr,pitch,w,h);Kernel2<<<100,512>>>(dPtr,pitch);
© Bedrich Benes
GPU Linear Memory 2D2D Memory Access
row and column are known (from the kernel indices), pitch is known from the allocation, and the element is of type *T.
Its location is:T* pElement
=(T*)((char*)baseAddress+row*pitch)+column;
9/25/2011
8
© Bedrich Benes
GPU Linear Memory 2D//kernel with nested cycles__global__ void Kernel(float *dPtr, int pitch,
int w, int h){
for (int r=0;r<h;r++){
float *row=(float*)((char*)dPtr+r*pitch);for (int c=0;c<w;c++){float element=row[c];//do some operation
}//of for c}//of for r
}//of Kernel
© Bedrich Benes
GPU Linear Memory 2D//kernel with implicit indexing
__global__ void Kernel2(float *dPtr, int pitch){
int i=blockDim.x*blockIdx.x+threadIdx.x;int j=blockDim.y*blockIdx.y+threadIdx.y;if ((i>=N) || (j>=N)) return;float *elm=(float*)((char*)dPtr+j*pitch)+i;*elm=0.5;//sets the value in the 2D array
}//of Kernel
© Bedrich Benes
Data Copy Linear Memory 2D
• enum cudaMemoryKind• cudaMemorycpyHostToDevice• cudaMemorycpyDeviceToHost• cudaMemorycpyDeviceToDevice
cudaMemcpy2D(void *dst, size_t dpitch,void *src, size_t spitch,size_t w, size_t h,enum cudaMemoryKind direction)
© Bedrich Benes
Data Copy Linear Memory 2D• uses two pitch values,
one for the source and one for the destination
• in the host memory, the pitch is usually the size of the row (in bytes)
9/25/2011
9
© Bedrich Benes
Data Copy 2D Exampleconst int MAX=500;//will need to be pitchedfloat a[MAX][MAX]float *dPtr;size_t pitch;int maxBytes=MAX*sizeof(float);error=cudaMallocPitch((void**)&dPtr,&pitch,maxBytes,MAX);error=cudaMemcpy2D(dPtr,pitch,
a,maxBytes, maxBytes,MAX,cudaMemcpyHostToDevice);
Kernel2<<<100,200>>>(dPtr,pitch);error=cudaMemcpy2D(a,maxBytes,
dPtr,pitch,maxBytes,MAX,cudaMemcpyDeviceToHost);
© Bedrich Benes
GPU Linear Memory 3D
cudaMemcpy3D(const struct cudaMemcpy3DParms *p)
cudaMalloc3D(struct cudaPitchedPtr *pitchedDevPtr,struct cudaExtent extent)
© Bedrich Benes
Constant Memory• Similar to global variables. Read only.• 64kB only, but very useful.
• Defined with global scope within the kernel file __constant__
• Initialised by the host cudaMemcpyToSymbol, cudaMemcpyFromSymbol
© Bedrich Benes
Constant Memory• Similar to global variables. Read only.• Defined with global scope within the
kernel file __constant__
• Initialised by the host cudaMemcpyToSymbol, cudaMemcpyFromSymbol
9/25/2011
10
© Bedrich Benes
Constant Variables• const float PI=3.14159;
• Will be in registers, as long as there is enough space.
• Will not be in the constant memory.
© Bedrich Benes
Page-locked (Pinned) Memory• is on the host
• can be read by the GPU directly and processed concurrently with the kernel execution
• useful for single read
© Bedrich Benes
Page-locked (Pinned) Memory
• attribs can be:• cudaHostAllocWriteCombined• cudaHostAllocMapped• cudaHostAllocPortable
cudaAllocHost(void **pH,sizet_t n,int attribs)cudaFreeHost(void *pH)
© Bedrich Benes
Portable Memory• page-locked memory has the benefits
only for the host thread that created it
• by making it portable the memory is available for all host threads
9/25/2011
11
© Bedrich Benes
Write-Combining memory• page-locked memory uses L1 and L2• cudaHostAllocWriteCombined
makes it write-combined• does not use cache
(more cache for other things)• not snooped during PCIEx (40% faster)• reading from the host is slow
should be used for host writes only© Bedrich Benes
Mapped Memory• some devices can map the page-locked
memory to the device address space → no need for reads/writes between host and device!
• the same page has two pointers one for the host and one for the device
• multiple GPUs can access the same page
© Bedrich Benes
Mapped Memory•
• maps the pointer at host pH(taken from cudaMallocHost()) and maps it to the device space pD
• flags is unused for now
cudaHostGetDevicePointer(void **pD,void *pHost,unsigned int flags)
© Bedrich Benes
Mapped Memory#if CUDART_VERSION<2020 #error “No support for mapped memory!\n“#endif//Check if device 0 supports mapped memorycudaDeviceProp devProp;cudaGetDeviceProperties(&devProp,0); if(!devProp.canMapHostMemory) {printf("Device cannot map host memory!\n”);exit(EXIT_FAILURE);
}
9/25/2011
12
© Bedrich Benes
Mapped Memorysize_t=1024*sizeof(float);float *aH,*aD;
cudaHostAlloc((void **)&aH, size,cudaHostAllocMapped);
//Get the device pointers to memory mapped cudaHostGetDevicePointer((void **)&aD,
(void *)aH,0);
© Bedrich Benes
Page-locked (Pinned) Memory• Speedup ?vector addition on Quadro FX 770M10,000x reading and writing the results
© Bedrich Benes
Texture Memory• can be faster than the global memory• is a global memory with cached access• (FERMI caches global memory as well)• cache is optimized for 2D spatial locality• designed for streaming fetches• read by kernel using texture fetches
© Bedrich Benes
Texture Memory• texture reference is an object• texture must be bounded, has attributes• texture can be linear mem or CUDA array• texture can be shared with OpenGL
9/25/2011
13
© Bedrich Benes
Texture Declaration
• type: float, basic integer• dim: 1,2,3• readMode: cudaReadModeNormalizedFloatranges: [0,1] or [-1,1]cudaReadModeElementTyperanges: 0…0XFF
texture<type,dim,readMode> texRef;
© Bedrich Benes
Texture Declaration• NTC (normalized texture coordinates)
are in range [0,1]• using floating point textures allows for
• wrapping• filtering
• using integer textures –outside values are clamped
© Bedrich Benes
Texture Binding
offset - returned because of alignmenttexref – the texture to binddevPtr – memory address on the devicedesc – channel formatsize – size of the memory
cudaBindTexture(size_t *offset,const struct textureReference *texref,const void *devPtr,const struct cudaChannelFormatDesc *desc,size_t size)
© Bedrich Benes
Texture Bindingstruct textureReference{int normalized; enum cudaTextureFilterMode filterMode,enum cudaTextureAddressMode addressMode[3];struct cudaChannelFormatDesc channelDesc;}
normalized:~ if 0, values are [0,…,width-1]x[0,…,height-1]x[0,…,depth-1]~ if 1, values are [0,1]3
9/25/2011
14
© Bedrich Benes
Texture Bindingfilter mode: specifies the filtering mode
cudaFilterModePoint –nearest neighbor sampling
cudaFilterModeLinear –(bi/tri) linear intrpolation
(valid only for floating point types)© Bedrich Benes
Texture Bindingaddress mode: defines what values out of range
cudaAddressModeClamp –clamped to the valid range
cudaAddressModeWrap –wrapped to the valid range
(valid only for floating point types)
© Bedrich Benes
Texture Bindingchannel Description:
cudaChannelFormatKind –cudaChannelFormatKindSignedcudaChannelFormatKindUnsignedcudaChannelFormatKindFloat
struct cudaChannelFormatDesc(int x,y,z,w; # of bits per componentenum cudaChannelFormatKind f)
© Bedrich Benes
Texture Binding ExamplecudaChannelFormatDesc channelDesc=cudaCreateChannelDesc(32,0,0,0,cudaChannelFormatKindFloat);
cudaArray* cu_array;//cuda arraycudaMallocArray(&cu_array,&channelDesc,width,height); cudaMemcpyToArray(cu_array,0,0,h_data,size,
cudaMemcpyHostToDevice));tex.addressMode[0]=cudaAddressModeWrap;tex.addressMode[1]=cudaAddressModeWrap;tex.filterMode=cudaFilterModeLinear;tex.normalized=true;cudaBindTextureToArray(tex,cu_array,channelDesc);//there is no “input” of the kernel, it is the textureBlurKernel<<<dimGrid,dimBlock>>>(d_data,width,height);
9/25/2011
15
© Bedrich Benes
Texture Binding Exampletexture<float,2,cudaReadModeElementType> tex;__global__ void BlurKernel(float* d_data,int w, int h) {unsigned int x=blockIdx.x*blockDim.x + threadIdx.x;unsigned int y=blockIdx.y*blockDim.y + threadIdx.y;float u=x/(float)w;float v=y/(float)h;//the textel itself//plus minus one in botm u and vfloat up=(x+1)/(float)w;float um=(x-1)/(float)w;float vp=(y+1)/(float)h;float vm=(y-1)/(float)h;
//read from texture, sum all nine neighbors, divide by nine //and write to global memoryd_data[y*width + x]=(tex2D(tex,u,v)+tex2D(tex,up,v)+tex2D(tex,um,v)+tex2D(tex,u,vp)+tex2D(tex,u,vm)+tex2D(tex,up,vp)+tex2D(tex,up,vm)+tex2D(tex,um,vp)+tex2D(tex,um,vm))/9.f;}
© Bedrich Benes
Texture Memory• pretty complicated setup• can be faster than the global memory
(it is cached)• can use Cuda Arrays• good for large read-only inputs
© Bedrich Benes
Reading• CUDA Programming Guide• Kirk, D.B., Hwu, W.W.,
Programming Massively Parallel Processors, NVIDIA, Morgan Kaufmann 2010
• Sanders, J., Kandrot, E.,CUDA by Example, Addison-Wesley