Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Internship Technical Report 2010
1 L2S,SUPELEC-CNRS-University Paris Sud XI.
Technical Report:
Speed Up of 2-D Separable Convolution Algorithm on
NVIDIA Graphics Processor by
Memory Transfer optimization
Under the Guidance of :
Nicolas-GAC
Laboratory of Signal and Systems(L2S)
(UMR 08506, CNRS-SUPELEC-Univ Paris Sud)
By :
Aditya Gautam (Roll No.200730003)
B.tech( ECE) ,IIIT Hyderabad, INDIA.
Internship Technical Report 2010
2 L2S,SUPELEC-CNRS-University Paris Sud XI.
CONTENT
TOPICS PAGE NO.
1) Introduction ----------------------------------------------------------------------------- 3
1.1) Purpose-------------------------------------------------------------------------- 3
1.2) Scope----------------------------------------------------------------------------- 3
1.3) Computing Power and Memory Bandwidth---------------------------- 4
1.4) Algorithm Complexity and Arithmetic Intensity----------------------- 4
2) Optimization Technique and Asynchronous transfer------------------------ 7
2.1) Pinned Memory--------------------------------------------------------------7
2.2) Asynchronous Transfer-----------------------------------------------------7
2.3) Streams-------------------------------------------------------------------------7
3) Results ---------------------------------------------------------------------------------- 8
4) Conclusion ------------------------------------------------------------------------------14
5) Future Work ----------------------------------------------------------------------------14
6) Internship Experience ----------------------------------------------------------------14
Internship Technical Report 2010
3 L2S,SUPELEC-CNRS-University Paris Sud XI.
1) Introduction:
1.1) Purpose:
My research Internship was focused on the speed up of the computationally expensive algorithm on
the Graphics Processor by memory transfer optimization. Aim of the project was to study the
contribution of the Memory transfer time on GPU for algorithm with different complexity and to find
a solution to reduce the memory transfer time between the CPU and the GPU. The algorithms
studied are matrix addition, matrix multiplication and 2-D separable convolution. The major hurdle
in implementing the algorithm on graphics processor is the slow memory transfer because of low
bandwidth between CPU and GPU as compared to device memory and GPU. Optimization technique
used is the multi stream transfer of data in asynchronous mode that leads to overlapping of the data
transfer with kernel execution. Pinned memory effect on the bandwidth between CPU and GPU are
also studied and presented. 2-D separable convolution method acceleration by optimizing the
memory transfer is another aim of this Internship.
1.2) Scope:
The proposed optimization technique can be used in the implementation of the computationally
heavy iterative reconstruction algorithm for Image restoration developed at Laboratory L2S,
SUPELEC. 3D/2D reconstruction algorithm used in CT Tomography developed at L2S requires 3-D
volume data transfer which takes a lot of time and thus a major bottleneck in the effective
implementation of this algorithm on GPU’s. So, this algorithm can be speed-up by this technique
which reduces the memory transfer time between CPU and GPU.
1.3) Computing Power and Memory Bandwidth:
With the advancement in technology, the computing power of the graphics processor is becoming
higher and higher as shown in Fig 1.1.
Figure 1.1 Peak GigaFlop Performance of the GPU and CPU(from NVIDIA’s Programming Guide 3.0).
Internship Technical Report 2010
4 L2S,SUPELEC-CNRS-University Paris Sud XI.
Figure 1.2 Floating point operations per second and Memory bandwidth for GPU and CPU (from
NVIDIA’s Programming Guide 3.0)
As shown in figure 1.2 the Floating point operation per second for GPU is much higher than CPU. The
reason behind this huge difference is because GPU is specialized for compute-intensive, highly
parallel computation and designed in a manner that more transistors are devoted to the data
computing and arithmetic operation rather than data caching, flow control or memory operation.
NVIDIA GPU used here is NVIDIA Tesla C1060 which has a global Memory of 42 GB/s with
bandwidth between device memory and the GPU as 102GB/sec. Size of shared memory is 16KB per
multiprocessor (NVIDIA Tesla C1060 has 30 multiprocessor) that can be shared by all the threads
belong to that multiprocessor. Latency of global memory is very high as compared to shared
memory. So the optimization in execution is done by the use of shared memory and performing the
operation on the data in the shared memory after transferring it from global memory.
Because of low bandwidth between CPU and GPU, the data transfer time is high and thus became a
limiting factor in the overall implementation of an algorithm on GPU’s.
1.4) Algorithm Complexity and Arithmetic Intensity:
Any algorithm can be categorized into computation bound or memory bound depending upon the
memory transfer and kernel execution time. If the data transfer time is high compared to
computation time then the algorithm is memory bound else computation bound.
Internship Technical Report 2010
5 L2S,SUPELEC-CNRS-University Paris Sud XI.
Figure 1.3 Plot indicating a data transfer time(%) and Computation time (%) for different algorithms.
(Image/Matrix size-> 1024*1024).
Three algorithms with different complexity are considered i.e Matrix addition, Matrix Multiplication,
2-D Separable Convolution. In matrix addition, the kernel execution (Computation) time(7%) is
relatively small as compared to Memory Transfer time, so it can be categorised into memory bound
and is located toward X axis(Data Transfer). 2-D separable Convolution with different kernel size(21
and 51) is considered and plotted. The one with kernel size 51 has computation time (50 %) and
memory time(50%), so it is neither a computation bound nor memory bound and located at the axis
(x=y) originating from centre. Matrix Multiplication is located in between addition and 2-D Separable
convolution and comes under the category of memory bound. The algorithm with location above the
line x=y are computation bound and one below the line are memory bound algorithm.
As Memory transfer optimization technique can be effectively implemented on the algorithm which
are memory bound i.e located below the line x=y. The extend to which the algorithm can be
effectively overlapped in the asynchronous transfer or the extend of data transfer reduction can also
be estimated by this plot. The more the distance between the algorithm location (in memory bound)
and the line (x=y), less is the extend of optimization (assuming the algorithm can be implemented by
the multi stream asynchronous transfer, to be discussed later).
Arithmetic Intensity is defined as the ratio of the number of arithmetic operation to the total
number of the memory transfer operation (Between CPU and GPU).
Add
Multiplication
2-D con(kernel-21)
2D-con(kernel-51)
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
Co
mp
uta
tio
n %
Data Transfer %
Memory / Computation Bound
Internship Technical Report 2010
6 L2S,SUPELEC-CNRS-University Paris Sud XI.
Arithmetic Intensity =
GPU are well suitable to address problem that can be expressed as data parallelism i.e. same
program is executed on many data parallel computation. Effective utilization of the GPU is
favourable by high Arithmetic intensity. Therefore arithmetic intensity of an algorithm is an
important factor to determine the extend to which it can be effectively implemented on GPU.
Matrix addition: A+B=C
Consider addition of matrix A and B (both of size N*N) whose result is stored in matrix C. Memory
transfer operation includes transfer of matrix A and B from Host to device(2*N*N) and transfer of
matrix C from device to Host(N*N). So the total number of memory transfer operation are 3*(N*N).
Elements of the resultant matrix C of size N*N is the addition of the corresponding element of matrix
A and B, so the number of arithmetic operation are N*N.
So, arithmetic Intensity =
=1/3.
Matrix Multiplication: A*B=C
For Matrix A and B (size N*N), each element of resultant matrix C is generated by taking the dot
product of a row of matrix A with the corresponding column of the matrix B. Total data transfer
operation includes transfer of matrix A and B from Host to Device(2*N*N) and Resultant Matrix C
from Device to Host(N*N). So the total number of memory transfer operation are 3*(N*N). However
to calculate one element N dot operation has to be performed, so for matrix C calculation total
number of operation are N*(N*N).
Arithmetic Intensity=
=N/3.
Here the arithmetic Intensity is directly proportional to size of the image. Larger the size, higher is
the arithmetic Intensity.
2-D Convolution:
Separable Convolution: In 2-D Separable Convolution kernel is decomposed into the horizontal and vertical kernel. Fist the image is convolved with the horizontal kernel and then the resultant image is convolved with the vertical kernel. For an image of size N*N and kernel size k, number of memory transfer operation is equal to size of an image + size of kernel from Host(CPU) to Device(GPU) and resultant image from Device to Host. However, number of operation for each pixel calculation is 2*k (k for horizontal and k for vertical) so for complete Image , it is 2*k*(N*N). Neglecting the size of kernel in comparison to the image size i.e N>>k, number of memory transfer operation is 2*(N*N).
Arithmetic Intensity=
=k , which is directly proportional to the length of the kernel.
Non Separable Convolution: Here the Image is convolved with the kernel of size k*k. So to calculate
one pixel total number of operation (Multiplication) to be performed is k*k. Data transfer operation
remains the same as that of Separable convolution.
Internship Technical Report 2010
7 L2S,SUPELEC-CNRS-University Paris Sud XI.
Arithmetic Intensity=
=
2) Optimization Technique and Asynchronous Transfer:
The peak bandwidth between the device (global Memory) and the GPU is much higher (102 Gb/s for
NVIDIA tesla C1060) than the peak bandwidth between the Host memory and the Device. This
imbalance is a bottleneck in the implementation of algorithm on GPU’s thus needs to be optimised.
The aim of the memory transfer optimization is to reduce the data transfer time which is done by
the overlap of the kernel execution with the data transfer between the Host and the Device.
2.1) Page Locked Host (Pinned Memory)
Page locked memory is the scared resources which allow the concurrent copy of the data between
the host and the device with the kernel execution on the GPU on some device which allow pinned
memory allocation and asynchronous transfer. It attains the highest bandwidth between the host
and the device.
2.2) Asynchronous transfer:
In general, data transfer between the host and the device are blocking transfer in which the control
is returned to host thread only after complete transfer of data to device. Asynchronous transfer is
the non blocking transfer in which the control is immediately returned to the host thread.
Asynchronous transfer requires pinned memory allocation on the host and the stream ID.
2.3) Stream is a sequence of operation which are performed in order on the device. One stream
may execute their command out of order with respect to other streams. This property is used to
overlap the data transfer of one stream with the kernel execution. Every Stream has stream ID . By
default the stream ID is zero.
Stream Creation
cudaStream_t stream[2]; for (int i = 0; i < 2; ++i) cudaStreamCreate(&stream[i]);
Asynchronous transfer allows the data transfer operation to be overlap with the kernel execution.
cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, stream1); kernel<<<grid, block, 0, stream2>>>(otherData_d);
In this, the asynchronous memory transfer takes place in stream1 and kernel execution takes place is
stream2, so there is an overlap between stream1 and stream2. However if we replace stream2 by
stream1, then there would not be any overlap and kernel execution will start only after the complete
data transfer from Host to Device.
Internship Technical Report 2010
8 L2S,SUPELEC-CNRS-University Paris Sud XI.
Figure 2.1 Timeline for copying data between Host and Device and kernel execution with one stream
(From NVIDIA’s CUDA Best Practices Guide 2.3)
Figure 2.2 Timeline for copying data and kernel execution with 4 stream for the same data as in
Figure 2.1 (From NVIDIA’s CUDA Best Practices Guide 2.3)
Total Time taken for doing the same operation (as in fig 2.1) with four streams is lesser than that of
single stream. This is because in multi stream case (as shown in figure 2.2 ) when the kernel
execution is taking place in one stream, at the same time the data transfer operation takes place in
another stream. This overlap reduces the total time and bandwidth between CPU and GPU can be
utilized while the GPU is executing the data. It can also be seen that there is no overlap between the
data transfer and kernel execution of a particular stream.
3) Results :
Allocation of pinned memory on Host (CPU) and asynchronous transfer (multi stream) with overlap
between memory transfer and kernel execution greatly reduced the total time and attains high
bandwidth between the CPU and GPU. The optimization technique discussed in section 2 is
implemented on the three subjected algorithm i.e matrix addition, matrix multiplication and 2-D
separable convolution.
All the results and simulation are done on the NVIDIA Tesla C1060. NVIDIA Tesla C1060 computing
processor board is a PCI Express 2.0 form factor computing add-in card based on the NVIDIA Tesla
T10 graphics processing unit (GPU). Properties of NVIDIA Tesla C1060 are following:
CUDA Driver : 3
Major Number : 1
Minor Number : 3
Global Memory : 42 GB
Multiprocessor : 30
Cores (8 each multiprocessor) : 240
Constant Memory : 65 KB
Pinned Memory : YES
Internship Technical Report 2010
9 L2S,SUPELEC-CNRS-University Paris Sud XI.
Concurrent Execute and copy : YES
Block Size : 512*512*64
Grid Size : 65536*65536*1
Processor Clock : 1296 MHz
Memory Clock : 800 MHz
Memory Bandwidth : 102GB/sec
As shown above, the device has capability for concurrent copy and execute and the pinned memory;
so asynchronous transfer is supported by the device NVIDIA Tesla C1060.
Matrix addition: Analysis of the contribution of the data transfer time(%) for Matrix addition on the
device NVIDIA Tesla C1060 without pinned memory allocation on host gives the following results:
Figure 3.1 Percentage of data transfer time(Y axis) with variation in Image size(X axis) for matrix
addition.
From Fig 3.1, it can be concluded that more than 93% of time is the data transfer time. So to speed
up the Memory transfer between the host and device, pinned memory is used on the host and the
reduction factor in terms of memory transfer time(%) is shown in the figure 3.2. After that
asynchronous transfer with multi stream(2 and 4) is also tried on the matrix addition in which the
overlap between the memory transfer and the kernel execution takes place. But the amount of
overlap is very less as the % of time in kernel execution is very less as compared to memory transfer
time.
90
91
92
93
94
95
96
97
98
99
512 1024 2048 4096 8192
Mem
ory
tra
nsf
er T
ime(
%)
Matrix Size
Internship Technical Report 2010
10 L2S,SUPELEC-CNRS-University Paris Sud XI.
Figure 3.2 Comparison of % memory transfer time(Y axis) for matrix addition with variation in matrix
size (X axis) with Not Pageable memory(pinned memory) and Pageable Memory(without pinned
memory)
Matrix Multiplication: Matrix multiplication is relatively computational expensive since resultant
matrix element require dot product of the row of one matrix with the column of other matrix. Matrix
multiplication on the device Tesla C1060 with and without pinned memory allocation gives the result
as shown is fig 3.3
Figure 3.3 Comparison of Data Transfer time (%) with Pageable Memory(without pinned)and Not
pagaeble memory(pinned memory) with variation in the matrix size(X axis) for matrix multiplication.
512 1024 2048 4096 8192
Pageable Memory 93.3 94 95.2 97.2 98.1
Not Pageable Memory 82.4 83.2 85.3 87 89.9
70
75
80
85
90
95
100
Me
mo
ry t
ran
sfe
r Ti
me
(%)
% Memory Transfer Time
128 256 512 1024 2048 4096 8192
Not Pageable Memory 43.3 52.52 60.7 76.72 81.77 82.62 84.85
Pageable Memory 57.37 65.75 72.85 81.45 85.06 88.03 91.2
0
10
20
30
40
50
60
70
80
90
100
% T
ime
In D
ata
Tra
nsf
er
Internship Technical Report 2010
11 L2S,SUPELEC-CNRS-University Paris Sud XI.
With the allocation of the Pinned memory on the host, there is a good reduction factor in memory
transfer time (%).
In multi stream asynchronous transfer, data is divided into equal parts each of which is dedicated to
a stream with particular stream ID. Data execution associated with one stream should be
independent of data associated with other stream to have proper overlap. However this is not the
case with Matrix multiplication. To calculate the first (or any) element of the product matrix we need
complete one column of one matrix and row of another matrix. To get the last column element we
need to copy complete matrix because in the global memory the data is stored in the form of 1-D
array with 1st row followed by consecutive rows. So the concurrent kernel execution and the data
transfer overlap cannot be possible in the matrix multiplication.
2-D Separable Convolution:
Figure 3.4 Memory Transfer time (%) for 2-D separable convolution (without pinned memory) with
variation in kernel size (image size-> 1024*1024).
With the increase in kernel size, the data transfer time remains almost the same (neglecting the
kernel size k in front of image size i.e k<<N*N). To calculate one pixel of the output image total
number of dot operation increases thus increasing the computation time. Therefore there is an
reduction in the data transfer time(%).
0
10
20
30
40
50
60
70
80
90
3 7 11 15 21 31 51
Mem
ory
Tra
nsf
er T
ime(
%)
Kernel Size
Internship Technical Report 2010
12 L2S,SUPELEC-CNRS-University Paris Sud XI.
Figure 3.5 Comparison of the Data Transfer time (%) with Pageable and Not Pageable Memory
(Pinned Memory ) with variation in the Image size(Kernel size->21).
Figure 3.6 Comparison of the Data Transfer time (%) with Pageable and Not Pageable Memory (
Pinned Memory ) with variation in the kernel size(Image size->1024*1024)
From figure 3.5 it can be concluded that as image size increase, the data transfer time increase and
also there is an increase in the total computation time (more operation with larger image). So the
data transfer time (%) remains almost the same or decrease. With the effect of pinned memory
there is a speed up of data transfer thus reduction in data transfer time. With pinned memory there
is an reduction in data transfer time(%) for different kernel size as shown in figure 3.6.
512 1024 2048 3072 4096 8196
Not Pageable Memory 54.7 57.9 59.2 59.6 59.4 59.2
Pageable Memory 70.5 70.6 70.5 63 63 62.3
0
10
20
30
40
50
60
70
80
Me
mo
ry T
ran
sfe
r T
ime
(%)
% Memory Transfer Time
3 7 11 15 21 31
Not Pageable Memory 80.7 74.6 67.2 63.2 56.9 50.2
Pageable Memory 82.2 77.8 74 71.1 66.69 59.9
0
10
20
30
40
50
60
70
80
90
Me
mo
ry T
ran
sfer
(%)
% Memory Transfer time
Internship Technical Report 2010
13 L2S,SUPELEC-CNRS-University Paris Sud XI.
Figure 3.7 Acceleration factor (with respect to C code implementation on Core 2 Duo Intel
Processor) for 2-D Separable convolution with synchronous transfer (without pinned memory) and
asynchronous transfer multi streams(1,2,4) with variation in kernel size and keeping image size
constant(1024*1024).
In figure 3.7, the results with asynchronous transfer have better acceleration factor with respect to
corresponding C code implementation on CPU because of the use of pinned memory and the
overlap of data transfer with kernel execution. Results with 2 streams are better than 4 stream
because every data transfer has an overhead associated, thus batching a large data into smaller
parts proved to be expensive than transferring the whole data at a time. Therefore, larger the
number of streams in the asynchronous transfer, larger is the overhead . So there is tradeoff
between the number of streams and the data size of the input.
0
5
10
15
20
25
30
35
40
3 5 11 21 31
Acc
ele
rati
on
Fac
tor
Acceleration Factor
Synchronous
1 streams
2 streams
4 streams
12.3 13
17.3
22.4 22.5
17.1
23.2
30 30.6 32
19.4
32.5
4541.1
48
13.1
30.5
44.548 49
0
10
20
30
40
50
60
512 1024 2048 3072 4096
Synchronous 1 streams 2 Streams 4 Streams
Internship Technical Report 2010
14 L2S,SUPELEC-CNRS-University Paris Sud XI.
Figure 3.8(above) Acceleration factor (with respect to C code implementation on Intel Core 2 Duo
CPU) for synchronous mode(without pinned memory) and asynchronous mode (1,2,4 streams) with
variation in Image size(X axis). Kernel size is kept constant as 21.
The asynchronous mode with one stream is equivalent to synchronous mode with pinned memory
on the host. So from Figure 3.8, the contribution of the pinned memory and the multi stream
overlap can be estimated in asynchronous transfer. Results with two streams are better than four
streams for image size less than 2048. However, for image size more than 2048, results with four
streams are better because of the tradeoffs between the overload associated and the number of
streams and size associated with each stream.
4) Conclusion:
With the continuous gain in the computing power of the Graphics processor and the need to speed
up of various real time application on the GPU’s. Memory transfer Optimization has become an
important research area in which I lot of research work to optimize the memory operation and
increase the bandwidth between the CPU and GPU needs to be done and to bring it in
commensuration with the computation time on the GPU’s.
Enhancement of the overall performance and reduction of data transfer time between CPU and GPU
is done by the allocation of pinned memory on the host and asynchronous transfer of data
overlapped with kernel execution. As shown in the result, use of pinned memory on the host attains
high bandwidth between CPU and GPU. In the case of 2D separable Convolution the acceleration is
brought about by the overlap of the kernel execution with the transfer of data. It is the fastest way
to perform the 2-D separable convolution on the Graphic Processor. This technique can also find
application in iterative algorithm and image restoration method.
5) Future Work :
With the reduction in the data overlap time there can also be possibility to further enhance the
performance by multi GPU. Distribution of the entire load by dividing the input image in an specific
pattern. In 2-D convolution the Apron(zero padding is done) with take considerable amount of time
as to calculate the border pixels. It can be reduce if the apron is neglected and the pixels which are
computed using apron remains the same. In this there can be tradeoff between the quality and the
time. Another important consideration is the coalesced access of global memory which if not
coalesced properly proved to be expensive.
6) My Internship Experience :
My summer Research Internship in L2S,SUPELEC was the best experience I had, in my life. Though
this Internship I, not only gained the theoretical knowledge and practical experience to work on
GPU's but also experienced new things which was beyond my expectation. I am very much
influenced by the French culture and work ethics of the people. I very much like working in this
research field of speeding up of image processing algorithm on the graphics processor. The most
blessed part of my three month stay was the guidance I got, it gave me a chance to work
independently, to think out of box and finally to choose the research field for my future. Whenever I
faced any problem, my lab colleagues were always there to help me whether it’s a visa issue,
Internship Technical Report 2010
15 L2S,SUPELEC-CNRS-University Paris Sud XI.
language problem or something to deal with my Internship subject. It was a good experience to
interact with researcher from different backgrounds and countries and know their motivations and
dedications, which motivated and inspired me to work better. I found people very amiable and
helpful; they made me feel comfortable in this completely unknown environment. It was a pleasure
to be the part of the esteemed research team at L2S, SUPELEC which is one of the most prestigious
research institutes in France. My only desire now is to be a part of the research team L2S after my
graduation and contribute towards the betterment of the society and to do industrial research to
the best of my potential. I am thankful to Nicolas-GAC and Ali Mohammad-Djafari for giving me this
lifetime opportunity that gave me an international platform to work and motivate me to work in this
research field. I look forward to a great future in L2S to carry on my research.