Speed Up of 2-D Separable Convolution Algorithm on NVIDIA ...webpages.lss.supelec.fr/perso/nicolas.gac/encadrements/Rapport_stage_Aditya_GAUTAM...Internship Technical Report 2010

Internship Technical Report 2010

1 L2S,SUPELEC-CNRS-University Paris Sud XI.

Technical Report:

Speed Up of 2-D Separable Convolution Algorithm on

NVIDIA Graphics Processor by

Memory Transfer optimization

Under the Guidance of :

Nicolas-GAC

Laboratory of Signal and Systems(L2S)

(UMR 08506, CNRS-SUPELEC-Univ Paris Sud)

By :

Aditya Gautam (Roll No.200730003)

B.tech( ECE) ,IIIT Hyderabad, INDIA.



CONTENT

TOPICS PAGE NO.

1) Introduction ----------------------------------------------------------------------------- 3

1.1) Purpose-------------------------------------------------------------------------- 3

1.2) Scope----------------------------------------------------------------------------- 3

1.3) Computing Power and Memory Bandwidth---------------------------- 4

1.4) Algorithm Complexity and Arithmetic Intensity----------------------- 4

2) Optimization Technique and Asynchronous transfer------------------------ 7

2.1) Pinned Memory--------------------------------------------------------------7

2.2) Asynchronous Transfer-----------------------------------------------------7

2.3) Streams-------------------------------------------------------------------------7

3) Results ---------------------------------------------------------------------------------- 8

4) Conclusion ------------------------------------------------------------------------------14

5) Future Work ----------------------------------------------------------------------------14

6) Internship Experience ----------------------------------------------------------------14



1) Introduction:

1.1) Purpose:

My research Internship was focused on the speed up of the computationally expensive algorithm on

the Graphics Processor by memory transfer optimization. Aim of the project was to study the

contribution of the Memory transfer time on GPU for algorithm with different complexity and to find

a solution to reduce the memory transfer time between the CPU and the GPU. The algorithms

studied are matrix addition, matrix multiplication and 2-D separable convolution. The major hurdle

in implementing the algorithm on graphics processor is the slow memory transfer because of low

bandwidth between CPU and GPU as compared to device memory and GPU. Optimization technique

used is the multi stream transfer of data in asynchronous mode that leads to overlapping of the data

transfer with kernel execution. Pinned memory effect on the bandwidth between CPU and GPU are

also studied and presented. 2-D separable convolution method acceleration by optimizing the

memory transfer is another aim of this Internship.

1.2) Scope:

The proposed optimization technique can be used in the implementation of the computationally

heavy iterative reconstruction algorithm for Image restoration developed at Laboratory L2S,

SUPELEC. 3D/2D reconstruction algorithm used in CT Tomography developed at L2S requires 3-D

volume data transfer which takes a lot of time and thus a major bottleneck in the effective

implementation of this algorithm on GPU’s. So, this algorithm can be speed-up by this technique

which reduces the memory transfer time between CPU and GPU.

1.3) Computing Power and Memory Bandwidth:

With the advancement in technology, the computing power of the graphics processor is becoming

higher and higher as shown in Fig 1.1.

Figure 1.1 Peak GigaFlop Performance of the GPU and CPU(from NVIDIA’s Programming Guide 3.0).



Figure 1.2 Floating point operations per second and Memory bandwidth for GPU and CPU (from

NVIDIA’s Programming Guide 3.0)

As shown in figure 1.2 the Floating point operation per second for GPU is much higher than CPU. The

reason behind this huge difference is because GPU is specialized for compute-intensive, highly

parallel computation and designed in a manner that more transistors are devoted to the data

computing and arithmetic operation rather than data caching, flow control or memory operation.

NVIDIA GPU used here is NVIDIA Tesla C1060 which has a global Memory of 42 GB/s with

bandwidth between device memory and the GPU as 102GB/sec. Size of shared memory is 16KB per

multiprocessor (NVIDIA Tesla C1060 has 30 multiprocessor) that can be shared by all the threads

belong to that multiprocessor. Latency of global memory is very high as compared to shared

memory. So the optimization in execution is done by the use of shared memory and performing the

operation on the data in the shared memory after transferring it from global memory.

Because of low bandwidth between CPU and GPU, the data transfer time is high and thus became a

limiting factor in the overall implementation of an algorithm on GPU’s.

1.4) Algorithm Complexity and Arithmetic Intensity:

Any algorithm can be categorized into computation bound or memory bound depending upon the

memory transfer and kernel execution time. If the data transfer time is high compared to

computation time then the algorithm is memory bound else computation bound.



Figure 1.3 Plot indicating a data transfer time(%) and Computation time (%) for different algorithms.

(Image/Matrix size-> 1024*1024).

Three algorithms with different complexity are considered i.e Matrix addition, Matrix Multiplication,

2-D Separable Convolution. In matrix addition, the kernel execution (Computation) time(7%) is

relatively small as compared to Memory Transfer time, so it can be categorised into memory bound

and is located toward X axis(Data Transfer). 2-D separable Convolution with different kernel size(21

and 51) is considered and plotted. The one with kernel size 51 has computation time (50 %) and

memory time(50%), so it is neither a computation bound nor memory bound and located at the axis

(x=y) originating from centre. Matrix Multiplication is located in between addition and 2-D Separable

convolution and comes under the category of memory bound. The algorithm with location above the

line x=y are computation bound and one below the line are memory bound algorithm.

As Memory transfer optimization technique can be effectively implemented on the algorithm which

are memory bound i.e located below the line x=y. The extend to which the algorithm can be

effectively overlapped in the asynchronous transfer or the extend of data transfer reduction can also

be estimated by this plot. The more the distance between the algorithm location (in memory bound)

and the line (x=y), less is the extend of optimization (assuming the algorithm can be implemented by

the multi stream asynchronous transfer, to be discussed later).

Arithmetic Intensity is defined as the ratio of the number of arithmetic operation to the total

number of the memory transfer operation (Between CPU and GPU).

Add

Multiplication

2-D con(kernel-21)

2D-con(kernel-51)

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

Co

mp

uta

tio

n %

Data Transfer %

Memory / Computation Bound



Arithmetic Intensity =

GPU are well suitable to address problem that can be expressed as data parallelism i.e. same

program is executed on many data parallel computation. Effective utilization of the GPU is

favourable by high Arithmetic intensity. Therefore arithmetic intensity of an algorithm is an

important factor to determine the extend to which it can be effectively implemented on GPU.

Matrix addition: A+B=C

Consider addition of matrix A and B (both of size N*N) whose result is stored in matrix C. Memory

transfer operation includes transfer of matrix A and B from Host to device(2*N*N) and transfer of

matrix C from device to Host(N*N). So the total number of memory transfer operation are 3*(N*N).

Elements of the resultant matrix C of size N*N is the addition of the corresponding element of matrix

A and B, so the number of arithmetic operation are N*N.

So, arithmetic Intensity =

=1/3.

Matrix Multiplication: A*B=C

For Matrix A and B (size N*N), each element of resultant matrix C is generated by taking the dot

product of a row of matrix A with the corresponding column of the matrix B. Total data transfer

operation includes transfer of matrix A and B from Host to Device(2*N*N) and Resultant Matrix C

from Device to Host(N*N). So the total number of memory transfer operation are 3*(N*N). However

to calculate one element N dot operation has to be performed, so for matrix C calculation total

number of operation are N*(N*N).

Arithmetic Intensity=

=N/3.

Here the arithmetic Intensity is directly proportional to size of the image. Larger the size, higher is

the arithmetic Intensity.

2-D Convolution:

Separable Convolution: In 2-D Separable Convolution kernel is decomposed into the horizontal and vertical kernel. Fist the image is convolved with the horizontal kernel and then the resultant image is convolved with the vertical kernel. For an image of size N*N and kernel size k, number of memory transfer operation is equal to size of an image + size of kernel from Host(CPU) to Device(GPU) and resultant image from Device to Host. However, number of operation for each pixel calculation is 2*k (k for horizontal and k for vertical) so for complete Image , it is 2*k*(N*N). Neglecting the size of kernel in comparison to the image size i.e N>>k, number of memory transfer operation is 2*(N*N).


=k , which is directly proportional to the length of the kernel.

Non Separable Convolution: Here the Image is convolved with the kernel of size k*k. So to calculate

one pixel total number of operation (Multiplication) to be performed is k*k. Data transfer operation

remains the same as that of Separable convolution.




=

2) Optimization Technique and Asynchronous Transfer:

The peak bandwidth between the device (global Memory) and the GPU is much higher (102 Gb/s for

NVIDIA tesla C1060) than the peak bandwidth between the Host memory and the Device. This

imbalance is a bottleneck in the implementation of algorithm on GPU’s thus needs to be optimised.

The aim of the memory transfer optimization is to reduce the data transfer time which is done by

the overlap of the kernel execution with the data transfer between the Host and the Device.

2.1) Page Locked Host (Pinned Memory)

Page locked memory is the scared resources which allow the concurrent copy of the data between

the host and the device with the kernel execution on the GPU on some device which allow pinned

memory allocation and asynchronous transfer. It attains the highest bandwidth between the host

and the device.

2.2) Asynchronous transfer:

In general, data transfer between the host and the device are blocking transfer in which the control

is returned to host thread only after complete transfer of data to device. Asynchronous transfer is

the non blocking transfer in which the control is immediately returned to the host thread.

Asynchronous transfer requires pinned memory allocation on the host and the stream ID.

2.3) Stream is a sequence of operation which are performed in order on the device. One stream

may execute their command out of order with respect to other streams. This property is used to

overlap the data transfer of one stream with the kernel execution. Every Stream has stream ID . By

default the stream ID is zero.

Stream Creation

cudaStream_t stream[2]; for (int i = 0; i < 2; ++i) cudaStreamCreate(&stream[i]);

Asynchronous transfer allows the data transfer operation to be overlap with the kernel execution.

cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, stream1); kernel<<<grid, block, 0, stream2>>>(otherData_d);

In this, the asynchronous memory transfer takes place in stream1 and kernel execution takes place is

stream2, so there is an overlap between stream1 and stream2. However if we replace stream2 by

stream1, then there would not be any overlap and kernel execution will start only after the complete

data transfer from Host to Device.



Figure 2.1 Timeline for copying data between Host and Device and kernel execution with one stream

(From NVIDIA’s CUDA Best Practices Guide 2.3)

Figure 2.2 Timeline for copying data and kernel execution with 4 stream for the same data as in

Figure 2.1 (From NVIDIA’s CUDA Best Practices Guide 2.3)

Total Time taken for doing the same operation (as in fig 2.1) with four streams is lesser than that of

single stream. This is because in multi stream case (as shown in figure 2.2 ) when the kernel

execution is taking place in one stream, at the same time the data transfer operation takes place in

another stream. This overlap reduces the total time and bandwidth between CPU and GPU can be

utilized while the GPU is executing the data. It can also be seen that there is no overlap between the

data transfer and kernel execution of a particular stream.

3) Results :

Allocation of pinned memory on Host (CPU) and asynchronous transfer (multi stream) with overlap

between memory transfer and kernel execution greatly reduced the total time and attains high

bandwidth between the CPU and GPU. The optimization technique discussed in section 2 is

implemented on the three subjected algorithm i.e matrix addition, matrix multiplication and 2-D

separable convolution.

All the results and simulation are done on the NVIDIA Tesla C1060. NVIDIA Tesla C1060 computing

processor board is a PCI Express 2.0 form factor computing add-in card based on the NVIDIA Tesla

T10 graphics processing unit (GPU). Properties of NVIDIA Tesla C1060 are following:

CUDA Driver : 3

Major Number : 1

Minor Number : 3

Global Memory : 42 GB

Multiprocessor : 30

Cores (8 each multiprocessor) : 240

Constant Memory : 65 KB

Pinned Memory : YES



Concurrent Execute and copy : YES

Block Size : 512*512*64

Grid Size : 65536*65536*1

Processor Clock : 1296 MHz

Memory Clock : 800 MHz

Memory Bandwidth : 102GB/sec

As shown above, the device has capability for concurrent copy and execute and the pinned memory;

so asynchronous transfer is supported by the device NVIDIA Tesla C1060.

Matrix addition: Analysis of the contribution of the data transfer time(%) for Matrix addition on the

device NVIDIA Tesla C1060 without pinned memory allocation on host gives the following results:

Figure 3.1 Percentage of data transfer time(Y axis) with variation in Image size(X axis) for matrix

addition.

From Fig 3.1, it can be concluded that more than 93% of time is the data transfer time. So to speed

up the Memory transfer between the host and device, pinned memory is used on the host and the

reduction factor in terms of memory transfer time(%) is shown in the figure 3.2. After that

asynchronous transfer with multi stream(2 and 4) is also tried on the matrix addition in which the

overlap between the memory transfer and the kernel execution takes place. But the amount of

overlap is very less as the % of time in kernel execution is very less as compared to memory transfer

time.

90

91

92

93

94

95

96

97

98

99

512 1024 2048 4096 8192

Mem

ory

tra

nsf

er T

ime(

%)

Matrix Size



Figure 3.2 Comparison of % memory transfer time(Y axis) for matrix addition with variation in matrix

size (X axis) with Not Pageable memory(pinned memory) and Pageable Memory(without pinned

memory)

Matrix Multiplication: Matrix multiplication is relatively computational expensive since resultant

matrix element require dot product of the row of one matrix with the column of other matrix. Matrix

multiplication on the device Tesla C1060 with and without pinned memory allocation gives the result

as shown is fig 3.3

Figure 3.3 Comparison of Data Transfer time (%) with Pageable Memory(without pinned)and Not

pagaeble memory(pinned memory) with variation in the matrix size(X axis) for matrix multiplication.

512 1024 2048 4096 8192

Pageable Memory 93.3 94 95.2 97.2 98.1

Not Pageable Memory 82.4 83.2 85.3 87 89.9

70

75

80

85

90

95

100

Me

mo

ry t

ran

sfe

r Ti

me

(%)

% Memory Transfer Time

128 256 512 1024 2048 4096 8192

Not Pageable Memory 43.3 52.52 60.7 76.72 81.77 82.62 84.85

Pageable Memory 57.37 65.75 72.85 81.45 85.06 88.03 91.2

0

10

20

30

40

50

60

70

80

90

100

% T

ime

In D

ata

Tra

nsf

er



With the allocation of the Pinned memory on the host, there is a good reduction factor in memory

transfer time (%).

In multi stream asynchronous transfer, data is divided into equal parts each of which is dedicated to

a stream with particular stream ID. Data execution associated with one stream should be

independent of data associated with other stream to have proper overlap. However this is not the

case with Matrix multiplication. To calculate the first (or any) element of the product matrix we need

complete one column of one matrix and row of another matrix. To get the last column element we

need to copy complete matrix because in the global memory the data is stored in the form of 1-D

array with 1st row followed by consecutive rows. So the concurrent kernel execution and the data

transfer overlap cannot be possible in the matrix multiplication.

2-D Separable Convolution:

Figure 3.4 Memory Transfer time (%) for 2-D separable convolution (without pinned memory) with

variation in kernel size (image size-> 1024*1024).

With the increase in kernel size, the data transfer time remains almost the same (neglecting the

kernel size k in front of image size i.e k<<N*N). To calculate one pixel of the output image total

number of dot operation increases thus increasing the computation time. Therefore there is an

reduction in the data transfer time(%).

0

10

20

30

40

50

60

70

80

90

3 7 11 15 21 31 51

Mem

ory

Tra

nsf

er T

ime(

%)

Kernel Size



Figure 3.5 Comparison of the Data Transfer time (%) with Pageable and Not Pageable Memory

(Pinned Memory ) with variation in the Image size(Kernel size->21).

Figure 3.6 Comparison of the Data Transfer time (%) with Pageable and Not Pageable Memory (

Pinned Memory ) with variation in the kernel size(Image size->1024*1024)

From figure 3.5 it can be concluded that as image size increase, the data transfer time increase and

also there is an increase in the total computation time (more operation with larger image). So the

data transfer time (%) remains almost the same or decrease. With the effect of pinned memory

there is a speed up of data transfer thus reduction in data transfer time. With pinned memory there

is an reduction in data transfer time(%) for different kernel size as shown in figure 3.6.

512 1024 2048 3072 4096 8196

Not Pageable Memory 54.7 57.9 59.2 59.6 59.4 59.2

Pageable Memory 70.5 70.6 70.5 63 63 62.3

0

10

20

30

40

50

60

70

80

Me

mo

ry T

ran

sfe

r T

ime

(%)

% Memory Transfer Time

3 7 11 15 21 31

Not Pageable Memory 80.7 74.6 67.2 63.2 56.9 50.2

Pageable Memory 82.2 77.8 74 71.1 66.69 59.9

0

10

20

30

40

50

60

70

80

90

Me

mo

ry T

ran

sfer

(%)

% Memory Transfer time



Figure 3.7 Acceleration factor (with respect to C code implementation on Core 2 Duo Intel

Processor) for 2-D Separable convolution with synchronous transfer (without pinned memory) and

asynchronous transfer multi streams(1,2,4) with variation in kernel size and keeping image size

constant(1024*1024).

In figure 3.7, the results with asynchronous transfer have better acceleration factor with respect to

corresponding C code implementation on CPU because of the use of pinned memory and the

overlap of data transfer with kernel execution. Results with 2 streams are better than 4 stream

because every data transfer has an overhead associated, thus batching a large data into smaller

parts proved to be expensive than transferring the whole data at a time. Therefore, larger the

number of streams in the asynchronous transfer, larger is the overhead . So there is tradeoff

between the number of streams and the data size of the input.

0

5

10

15

20

25

30

35

40

3 5 11 21 31

Acc

ele

rati

on

Fac

tor

Acceleration Factor

Synchronous

1 streams

2 streams

4 streams

12.3 13

17.3

22.4 22.5

17.1

23.2

30 30.6 32

19.4

32.5

4541.1

48

13.1

30.5

44.548 49

0

10

20

30

40

50

60

512 1024 2048 3072 4096

Synchronous 1 streams 2 Streams 4 Streams



Figure 3.8(above) Acceleration factor (with respect to C code implementation on Intel Core 2 Duo

CPU) for synchronous mode(without pinned memory) and asynchronous mode (1,2,4 streams) with

variation in Image size(X axis). Kernel size is kept constant as 21.

The asynchronous mode with one stream is equivalent to synchronous mode with pinned memory

on the host. So from Figure 3.8, the contribution of the pinned memory and the multi stream

overlap can be estimated in asynchronous transfer. Results with two streams are better than four

streams for image size less than 2048. However, for image size more than 2048, results with four

streams are better because of the tradeoffs between the overload associated and the number of

streams and size associated with each stream.

4) Conclusion:

With the continuous gain in the computing power of the Graphics processor and the need to speed

up of various real time application on the GPU’s. Memory transfer Optimization has become an

important research area in which I lot of research work to optimize the memory operation and

increase the bandwidth between the CPU and GPU needs to be done and to bring it in

commensuration with the computation time on the GPU’s.

Enhancement of the overall performance and reduction of data transfer time between CPU and GPU

is done by the allocation of pinned memory on the host and asynchronous transfer of data

overlapped with kernel execution. As shown in the result, use of pinned memory on the host attains

high bandwidth between CPU and GPU. In the case of 2D separable Convolution the acceleration is

brought about by the overlap of the kernel execution with the transfer of data. It is the fastest way

to perform the 2-D separable convolution on the Graphic Processor. This technique can also find

application in iterative algorithm and image restoration method.

5) Future Work :

With the reduction in the data overlap time there can also be possibility to further enhance the

performance by multi GPU. Distribution of the entire load by dividing the input image in an specific

pattern. In 2-D convolution the Apron(zero padding is done) with take considerable amount of time

as to calculate the border pixels. It can be reduce if the apron is neglected and the pixels which are

computed using apron remains the same. In this there can be tradeoff between the quality and the

time. Another important consideration is the coalesced access of global memory which if not

coalesced properly proved to be expensive.

6) My Internship Experience :

My summer Research Internship in L2S,SUPELEC was the best experience I had, in my life. Though

this Internship I, not only gained the theoretical knowledge and practical experience to work on

GPU's but also experienced new things which was beyond my expectation. I am very much

influenced by the French culture and work ethics of the people. I very much like working in this

research field of speeding up of image processing algorithm on the graphics processor. The most

blessed part of my three month stay was the guidance I got, it gave me a chance to work

independently, to think out of box and finally to choose the research field for my future. Whenever I

faced any problem, my lab colleagues were always there to help me whether it’s a visa issue,



language problem or something to deal with my Internship subject. It was a good experience to

interact with researcher from different backgrounds and countries and know their motivations and

dedications, which motivated and inspired me to work better. I found people very amiable and

helpful; they made me feel comfortable in this completely unknown environment. It was a pleasure

to be the part of the esteemed research team at L2S, SUPELEC which is one of the most prestigious

research institutes in France. My only desire now is to be a part of the research team L2S after my

graduation and contribute towards the betterment of the society and to do industrial research to

the best of my potential. I am thankful to Nicolas-GAC and Ali Mohammad-Djafari for giving me this

lifetime opportunity that gave me an international platform to work and motivate me to work in this

research field. I look forward to a great future in L2S to carry on my research.

Documents

Speed Up of 2-D Separable Convolution Algorithm on NVIDIA ...webpages.lss.supelec.fr/perso/nicolas.gac/encadrements/Rapport_stage_Aditya_GAUTAM...Internship Technical Report 2010