Upload
devendra-sharma
View
898
Download
3
Embed Size (px)
Citation preview
A
PROJECT REPORT
on
Numerical Methods
Implementation On CUDA
submitted for partial fulfillment for the degree of
Bachelor of Technology
in
Department of Computer Engineering
(2007-11)
Supervisor: Dr. Vijay Laxmi Ankur Sharma (2007UCP132)
Nihar Amin (2007UCP161)
Praveen Khokher (2007UCP157)
Shehjad Khan (2007UCP113)
MALAVIYA NATIONAL INSTITUTE Of TECHNOLOGY, JAIPUR
May 2011
Contents
Acknowledgements ix
Certificate xi
1 Overview Of CUDA Programming Model 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thread Level Heirarchy . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Memory Level Heirarchy . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Implementation Of Matrix Multiplication Algorithm On CUDA 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Matrix proves to be advantageous in the implementation of following logics:- 6
2.3 Sequential matrix-multiplication: . . . . . . . . . . . . . . . . . . . 6
2.4 Parallel matrix-multiplications on CUDA:- . . . . . . . . . . . . . . 6
2.4.1 Implementation: . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Kernel Specifications: . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6 Salient Features: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.7 Limitations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.8 Observations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.9 Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Implementation Of Prefix Sum Algorithm On CUDA 11
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Sequential Prefix-sum algorithm: . . . . . . . . . . . . . . . . . . . 12
3.3 Parallel Prefix-Sum On CUDA: . . . . . . . . . . . . . . . . . . . . 12
3.3.1 Implementation- . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Kernel Specifications: . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 Salient Features: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.6 Limitations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.7 Observations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.8 Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Implementation Of Bitonic Sort Algorithm On CUDA 17
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Parallel Bitonic-Sort On CUDA: . . . . . . . . . . . . . . . . . . . . 18
4.3 Salient Features: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
i
ii CONTENTS
4.4 Limitations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.5 Observations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.6 Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5 Implementation of Odd Even transposition Sort 23
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 The odd even merge sort is advantageous as it can . . . . . . . . . 23
5.3 Sequential Odd-Even Merge Sort: . . . . . . . . . . . . . . . . . . . 24
5.4 Parallel Odd Even Transposition Sort: . . . . . . . . . . . . . . . . 24
5.4.1 Implemention . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.5 Kernel Specification: . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.6 Salient Features:- . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.7 Limitations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.8 Observations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.9 Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6 Implementation Of Parallel Quicksort By Regular Sampling Algorithm On CUDA 29
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.2 Sequential Quicksort: . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.3 Parallel Quicksort Using Regular Sampling: . . . . . . . . . . . . . 30
6.3.1 Implementation: . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.4 Kernel Specifications: . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.5 Salient features: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.6 Limitations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.7 Observations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.8 Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7 Implementation of matrix transpose algorithm on CUDA 35
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.2 Matrix transpose proves to be advantageous in the implementation of following logics: 36
7.3 Sequential matrix transpose: . . . . . . . . . . . . . . . . . . . . . . 36
7.4 Parallel matrix transpose: . . . . . . . . . . . . . . . . . . . . . . . 36
7.4.1 Implementation: . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.5 Kernel specifications: . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.6 Salient features: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.7 Limitations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.8 Observations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.9 Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8 Implementation of parallel sum algorithm on CUDA 41
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.2 Parallel-sum proves to be advantageous in the implementation of following logics: 41
8.3 Sequential Parallel-Sum Algorithm:- . . . . . . . . . . . . . . . . . . 42
8.4 Parallel Prefix-Sum: . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.4.1 Implementation: . . . . . . . . . . . . . . . . . . . . . . . . . 42
CONTENTS iii
8.5 Kernel Specification:- . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.6 Salient Features:- . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.7 Limitations:- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.8 Observations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.9 Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
9 Calculation Of Variance and Standard Deviations on CUDA 47
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
9.2 Finding VARIANCE AND DEVIATION proves to be advantageous 47
9.3 Sequentially Calculate Variance and SD: . . . . . . . . . . . . . . . 48
9.4 Parallely Calculate Variance and SD: . . . . . . . . . . . . . . . . . 48
9.4.1 Implementation: . . . . . . . . . . . . . . . . . . . . . . . . . 48
9.5 Kernel Specification: . . . . . . . . . . . . . . . . . . . . . . . . . . 49
9.6 Limitations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
9.7 Observations:- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
9.8 Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
10 Data of Algorithms 53
List of Figures
1.1 Thread Level Heirarchy . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Memory Level Heirarchy . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Thread Level Heirarchy . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 execution time vs Input size . . . . . . . . . . . . . . . . . . . . . . 8
2.3 SpeedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 SpeedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 Prefix-sum algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Prefix-sum algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Prefix-sum algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1 Sample Bitonic Sorting . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Kernel Used in Bitonic Sorting . . . . . . . . . . . . . . . . . . . . . 19
4.3 Execution time vs input size . . . . . . . . . . . . . . . . . . . . . . 20
4.4 slope of speedUp vs input size . . . . . . . . . . . . . . . . . . . . . 20
4.5 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.1 Execution time vs input size . . . . . . . . . . . . . . . . . . . . . . 26
5.2 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.1 Sequential Quicksort algorithm . . . . . . . . . . . . . . . . . . . . 30
6.2 execution time vs input size . . . . . . . . . . . . . . . . . . . . . . 33
6.3 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.4 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.1 Execution time vs input size . . . . . . . . . . . . . . . . . . . . . . 38
7.2 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.3 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8.1 Execution time vs input size . . . . . . . . . . . . . . . . . . . . . . 44
8.2 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8.3 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 45
9.1 Execution time vs input size . . . . . . . . . . . . . . . . . . . . . . 50
9.2 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 50
9.3 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 51
v
List of Tables
10.1 Matrix Multiplication(time in 10−6s) . . . . . . . . . . . . . . . . . 54
10.2 Bitonic Sort Algorithm (time in 10−6s) . . . . . . . . . . . . . . . . 54
10.3 Prefix Sum (time in 10−6s) . . . . . . . . . . . . . . . . . . . . . . . 54
10.4 Odd-Even Transposition Sort (time in 10−6s) . . . . . . . . . . . . . 55
10.5 Quicksort (time in 10−6s) . . . . . . . . . . . . . . . . . . . . . . . . 55
10.6 Matrix-transpose (time in 10−6s) . . . . . . . . . . . . . . . . . . . 55
10.7 Summation Algorithm (time in 10−6s) . . . . . . . . . . . . . . . . 56
10.8 Variance and SD (time in 10−6s) . . . . . . . . . . . . . . . . . . . . 56
vii
Acknowledgements
We wish to express our gratitude to all people involved in the successful comple-
tion of our Final Year Major Project, especially to our project mentor Dr. Vijay
Laxmi for her guidance and critical reviews.
Our sincere thanks to Dr. M.S Gaur who was very generous to devote his
precious time, sharing his knowledge with us, and helping us out in every possible
manner
We are also thankful to all of our team members, working with whom was a
great experience.
And finally, our deep gratitude to our family members for their unflinching emo-
tional support during the whole period.
Ankur Sharma
Nihar Amin
Praveen Khokher
Shehjad Khan
May 2011
ix
Certificate
This is to certify that the work contained in this report entitled ”Numerical
Methods Implementation On CUDA” by Ankur Sharma (2007UCP132), Ni-
har Amin (2007UCP161), Praveen Khokher (2007UCP157) and Shehjad Khan
(2007UCP113) has been carried out under my supervision and this work has not
been submitted elsewhere for a degree.
May, 2011
Dr. Vijay Laxmi
Department of Computer Engineering,
Malaviya National Institute of Technology,
Jaipur.
xi
ABSTRACT
Parallel computing is the process of dividing large problems into smaller ones and
concurrently executing them. This implies that many computations are carried
out simultaneously. The main objective of devising parallel algorithms is to check
whether they give faster responses than their sequential versions. The implementa-
tion of numerical methods for heavy calculations on CUDA architecture and their
comparison with time taken for the same calculations sequentially on the CPU is
the basic aim of the project. The understanding of CUDA architecture and how
mapping is done using threads and blocks is first understood. Algorithms that can
be implemented parallely are recognized, their sequential CPU codes are written
and then their parallel implementation on CUDA architecture is done. Sets of
data are used to study the time taken by both implementations and inferences are
made. These are primarily on the basis of complexities of sequential algorithms
and their method of implementation on CUDA. Some parallel algorithms give suf-
ficient speed up and some are slower than the sequential versions. The reasons
and conclusions are inferred and optimizations that can be done are mentioned.
Chapter 1
Overview Of CUDA
Programming Model
1.1 Introduction
Compute Unified Device Architecture(CUDA) is an application programming in-
terface to the graphical processors .It is basically a parallel computing architecture
developed by Nvidia. The architecture emphasizes the thinking of working many
threads slowly in parallel rather than running a particular thread very fastly.CUDA
specific computations are performed on GPU(graphics processing units).The ar-
chitecture favours applications which are compute intensive rather then memory
intensive.It is a scalable programming model .Programmers generally use C for
CUDA for executing the code on GPU.
There are levels of abstarction in CUDA which are visible to the programmers:-
1. Thread level heirarchy
2. Memory level heirarchy
3. barrier synchronizations
The basic advantage of using CUDA is to run the parallel fraction of a large code
efficiently and quick.It basically follows the approach of dividing a large set of in-
put data into blocks and execute the different blocks in parallel.The main features
to look out for in parallel processing of blocks are efficient communication of data
between diffrent blocks and between the threads of the same block,synchronization
1
2 Chapter 1 Overview Of CUDA Programming Model
between blocks and threads of a block.
CUDA executes the sequential part of the code on CPU ,while the parallel portion
is executed on GPU.The GPU code is compiled by the open64 compiler that
produces parallel thread execution(PTX) files to run on the GPU.Qualifiers are
used to distinguish between the variables and functions of the CPU code and GPU
code .CUDA operates on single instruction multiple data (SIMD ) architecture but
the thread can diverge from this on the basis of conditional opeartors ,blockId and
threadId.
1.2 Thread Level Heirarchy
The Thread level abstraction can be viewed as shown below in figure:-
Figure 1.1: Thread Level Heirarchy
The thread level abstraction on CUDA can be viewed as a grid of blocks con-
taining threads.Each thread possesses a unique ID associated with it .A Block
can contain upto maximum of 512 threads quadroFX1700GPGPUarchitecture,a
thread basically can have its unique Id in x, y ,z dimension ,ie threadIdx.x, hrea-
dIdx.y, theadIdx.z.similarly a collection of blocks is called a grid and can contain
blocks in all the three dimensions.The threads within a block can communicate
with each other using the shared memory visible per block and can synchronize
there execution using the inbuilt syncthreads() function.The execution between
different blocks launched by the kernel cannot be done using the synthreads()
Chapter 1 Overview Of CUDA Programming Model 3
function.Different blocks communicate with each other using the device memory
or the global memory.when a kernel is launched a grid of thread blocks gets cre-
ated on the device with each thread block containing many threads .Both Fine
grained data parallelism and coarse grained data parallelism can be implemented
in CUDA .The threads provide Fine grained parallelism while the blocks provide
coarse grained parallelism.
1.3 Memory Level Heirarchy
The memory level abstraction can be viewed as shown below in figure:-
Figure 1.2: Memory Level Heirarchy
There are four different types of memories shown above:registers,shared,global,constant(not
including the texture memory).The global memory can be accessed by every
thread,different blocks and the CPU.The registers are specific to each thread and
are the fastest type of memory.The shared memory is visible to a particular block
and thus threads of a block can access the shared memory.Constant memory is
faster than global memory but slower than registers and shared memory however,
it can only be written to in host code. Device code can read constant memory but
it can not write to it.The sizes of global and constant memeory can scale in Gb’s
but the sizes of shared memory is very limited (usually upto 16Kb).
The memory allocation and deallocation of the global memory is done by the
host.Functions like cudaMemcpy() and cudaMalloc(),are used for the allocation
and movement of data from or to the device .Identifiers like cudaMemcpyDevice-
ToHost are used guide the direction of data transfer The memory transfer functions
4 Chapter 1 Overview Of CUDA Programming Model
can be synchronous as well as asynchronous . Synchronous means the CPU can
start its execution only after the entire data has been transfered to the GPU.
Chapter 2
Implementation Of Matrix
Multiplication Algorithm On
CUDA
2.1 Introduction
Matrix multiplication have inherent parallelism in it and thus by using a parallel
architecture we can compute the work in lesser time i.e achieve speed up. We
multiply to matrix of size M x N and N x O and get a resulting matrix of dimension
M x O. Its a necessary condition that the number of column of 1st matrix is equal
to number of rows of 2nd matrix,otherwise multiplicationis not possible.
Figure 2.1: Thread Level Heirarchy
INPUT- Two matrices say, A and B with dimensions M x N and N x O
OUTPUT Final matrix with dimension M x O .
5
6 Chapter 2 Implementation Of Matrix Multiplication Algorithm On CUDA
2.2 Matrix proves to be advantageous in the im-
plementation of following logics:-
1. Graph Theory
2. Probability theory and statistics
3. Symmetries and transformations of physics
4. MATLAB
2.3 Sequential matrix-multiplication:
Suppose we have to multiple two matrix A and B and get the final result in matrix
C. Then each element of C can be found by
sum=sum+ mat1[i][k]*mat2[k][j];
mat3[i][j]=sum;
here r1 is the number of rows of first matrix and c2 is the number of coloums of
second matrix
for(i=0;i < r1;i=i+1)
{
for(j=0;j < c2;j=j+1)
{
sum=0;
for(k=0;k < c1;k++)
sum=sum+mat1[i][k]*mat2[k][j];
mat3[i][j]=sum;
}
}
2.4 Parallel matrix-multiplications on CUDA:-
As matrix multiplication have many independent stages thus we can think of get-
ting some speed-up using parallel architecture like CUDA.
Chapter 2 Implementation Of Matrix Multiplication Algorithm On CUDA 7
2.4.1 Implementation:
We launch same number of threads as the number of element in a resultant matrix.
Each thread simultaneously calculate the the corresponding index of the resultant
matrix. Our blocks are of 2D nature and have dimension N x O (here we have
taken input values such that N and O both are equal). Both the dimensions of 2D
grid is equal to sqrt(total number of blocks lauched).Indexing to each element is
done using the threadIdx, threadIdx , blockIdx and blockIdx.
dim3 threads(My block blocksX,My block);
float grid D=sqrt(My block); dim3 grid(grid D,grid D);
Indexing int row = blockIdx*block D +threadIdx; int col = blockIdx*block D
+threadIdx;
2.5 Kernel Specifications:
A) global void matrixMul globalmemory - 9 registers,28+16 bytes of smem,4
bytesof cmem[1].
2.6 Salient Features:
1. We have implemented on global memory as our threads are independent of
each other and we face no synchronisation problem.
2. Motivation for using global memory was to run our code for matrices with
large dimensions.
3. The code is generalised to run on very lage number of values.
4. Both the times t1(without considering memory copy overhead) and t2(considering
memory transfers overhead) are calculated .
2.7 Limitations:
1. For lage values of arrays(>512 values),the input size was limited to the mul-
tiples of 512.
8 Chapter 2 Implementation Of Matrix Multiplication Algorithm On CUDA
2. GPU-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY
CALCULATOR.
2.8 Observations:
1. Immediate speedUp for N>32,due to n3 complexity of sequential algorithm.
2. Sequential time almost linearly proportional to size of resultant matrix.
3. Initial speedUp
Nie the slope of the speedUp graph is very steep.
4. With the increase in size of the input ,time taken by sequential code increases
almost linearly,whereas the time taken by the kernel to execute remains a
constant ,but the overall performance of the parallel code is degraded by the
time acoounted for memorycopy overhead between host and device .
2.9 Conclusions:
1. As the sequential algorithm is of order n to the power 3 thus for large of
dimensions we got a decent speed-up.
2. Parallel approach very favourable when sequential complexity is higher.
3. Even better speedUps can be achieved with memory optimization techniques.
Figure 2.2: execution time vs Input size
Chapter 2 Implementation Of Matrix Multiplication Algorithm On CUDA 9
Figure 2.3: SpeedUp vs input size
Figure 2.4: SpeedUp vs input size
Chapter 3
Implementation Of Prefix Sum
Algorithm On CUDA
3.1 Introduction
Prefix sum also known as the partial sum of the series is in programming terms
the fold of addition operation.Th prefix sum is considered to be the simplest and
most useful block of parallel algorithms.The prefix sum can be calculated for a
very large sets of input data and is generally described a s below:-
For a set of N values { a1,a2,a3,a4..................................an }
prefix-sum can be calculated as { a1,(a1+a2),(a1+a2+a3),.....(a1+.....an-1}
For Example - a[8]={1,3,4,2,6,3,7,1}
prefix-sum ={1,4,8,10,16,19,26,27}
Prefix-sum proves to be advantageous in the implementation of following logics:-
1. In the implementation of radix sort quick sort .
2. Performing lexical analysis and search for regular expressions.
3. In evaluating polynomials ,solving recurrences and addition of multiprecision
numbers .
4. It can be very much helpful in performing string matching algorithms.
11
12 Chapter 3 Implementation Of Prefix Sum Algorithm On CUDA
3.2 Sequential Prefix-sum algorithm:
The sequential prefix-sum algorithm is a very simple method to calculate the
prefix-sum of a given input array of numbers ,just by looping through the size of
the array and adding the current value with that of the previous indexed value
.The logic is demonstrated as below:-
for( i=1;i<size;i=i+1)
a[i]=a[i]+a[i-1];
This code performs exactly N adds for a array of size N.and thus is a very simple
implementation.
3.3 Parallel Prefix-Sum On CUDA:
The prefix-sum algorithm can be very efficiently performed using the parallel archi-
tecture.We just need to divide the input array into blocks of proper dimension.and
launch the kernel.
3.3.1 Implementation-
For a input array of size N(can be very large),a single dimension grid is created
with ( N512
) blocks.If the size of the input is N<512 ,then a grid with one block and
containing N number of threads is launched by the kernel function.
Each of the blocks is provided with a shared array of size=512 and its set of shared
variables.All the values of the input array which are stored in global memory are
mapped with a specific thread ID dependent on the number of blocks
ID=blockIdx*dim block + threadIdx;
Thus,respective elements are copied from the global memory to the shared memory
of each block. The parallel sums of values in each block is generated and stored
in a global array according to the respective block index.
3.4 Kernel Specifications:
1. global Sum prefix() - 6 registers,4120+16bytes of smem,4 bytes of cmem[1]
Chapter 3 Implementation Of Prefix Sum Algorithm On CUDA 13
2. global void sum()- 5 registers,2076+16 bytes of smem, 8bytes of cmem[1]
3.5 Salient Features:
1. The use of shared memory to perform consecutive reads,which reduces the
time that would have been spent in performing the same reads and write
using global memory.
2. Performing a proper synchronization between threads operating in parallel
inside a block.
3. It was difficult to perform synchronization between different blocks ,so the
sums of previous blocks were propagated to the consecutive blocks using a
global array .
4. The code is generalised to run on very large number of values.
5. Both the times t1(without considering memory copy overhead) and t2(considering
memory transfers overhead) are calculated .
3.6 Limitations:
1. For lage values of arrays(>512 values),the input size was limited to the mul-
tiples of 512.
2. gpu-occupancy of 67 % was achieved as calculated by the GPU-OCCUPANCY
CALCULATOR..
3.7 Observations:
1. For very small input sizes ,the sequential prefix sum appears to be much
faster then the parallel code
2. With the increase in size of the input ,time taken by sequential code increases
almost linearly,whereas the time taken by the kernel to execute remains a
constant.
14 Chapter 3 Implementation Of Prefix Sum Algorithm On CUDA
3. Very large speedup wrt. kernel execution times are achieved,which de-
mostrates the efficiency of running the parallel code on cuda ,but the memory
overhead for large values limits the overall speed up .
3.8 Conclusions:
1. Using effiecient memory optimizing techniques,the memory transfer overhead
between the host and the device can be reduced.
2. Using much better kernel optimization speedUp can be increased.
Figure 3.1: Prefix-sum algorithm
Chapter 3 Implementation Of Prefix Sum Algorithm On CUDA 15
Figure 3.2: Prefix-sum algorithm
Figure 3.3: Prefix-sum algorithm
Chapter 4
Implementation Of Bitonic Sort
Algorithm On CUDA
4.1 Introduction
It is a fast method to sort the large number of values. Basically contains two
types of operations which are shown by down arrow(also by (+) operation ,just
a symbolic representation) and up arrow(also by (-) operation). In + operation
both the values are compared and after comparsion larger value should be at higher
index (for this purpose swapping might be required).In - operation both the values
are compared and larger value should be at the lower index(again swapping may
or may not be required).
INPUT:- Array of N element say A OUTPUT:- Sorted array of A, say sort(A)=
such that
for (i and j )=0 to n-1
a(i)<=a(j) for i<j
Bitonic-sort proves to be advantageous in the implementation of following logics:-
1. In any application which requires sorted input as for example binary search
algorithm.
2. In forming directory and managing large data.
17
18 Chapter 4 Implementation Of Bitonic Sort Algorithm On CUDA
Figure 4.1: Sample Bitonic Sorting
4.2 Parallel Bitonic-Sort On CUDA:
The parallel bitonic-sort can be very efficiently performed using the parallel CUDA
architecture.For N number of element , we can divide our problem into log to the
(base 2 ) of N number of stages,and further each stage can be divided into number
of substages. For stage i number of sub- stages in it are equal to i, i.e if we have
8 elements then total number of stages are 3. 1st stage have 1 sub-stage,2nd stage
have 2 sub-stages and 3rd stage have 3 sub-stages. Each sub-stage has to do N/2
number of independent computations. Thus we can lauch N/2 number of threads
for these N/2 computations. But sub-stages are not independent from each other
and thus we have to ensure proper synchronization between threads , otherwise
we will get incorrect results.
As in our CUDA architecture we can only at maximum have 512 threads in a
block thus, for values larger than 512, we have to launch multiple number of
blocks. As we feel we have to perform interblock synchronization,which we have
tried but can’t implement it so we have computed result only upto 512 values.
We have to find whether the thread has to perform (+) or (-) operations. For
this purpose we have used a flag variable in our kernel flag=(int)(id/power(i))%2;
If flag has value 0 then we have to perform (+) operation,otherwise the (-) oper-
ation. Threads in blocks are of 1D nature and can be accessed by indexing them
using threadIdx.x
indexing id = threadIdx ;
Chapter 4 Implementation Of Bitonic Sort Algorithm On CUDA 19
For synchronisation of threads of the same block we have used the standard library
function syncthreads();
Figure 4.2: Kernel Used in Bitonic Sorting
4.3 Salient Features:
1. Different sub-satge at the same stage level are not independent.
2. In last stage we only have to perform (+) operations.
4.4 Limitations:
1. We have assumed that the number of input value must be in form of 2’s
power,like 4 ,8 , 16 , 32, 64, 128, 256, 512
2. As we have only used 1D block so at max we can take 512 values for the
sorting.
3. gpu-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY
CALCULATOR..
4.5 Observations:
1. SpeedUp gained for (N>256).
2. For sequential nearly linear increase in time with increasing N.
3. Very sharp increase in speedUp after (N=256).
20 Chapter 4 Implementation Of Bitonic Sort Algorithm On CUDA
4.6 Conclusions:
1. Speedup due to memory overhead decreases significantly.
2. Much higher SpeedUps can be achieved with multiple blocks.
Figure 4.3: Execution time vs input size
Figure 4.4: slope of speedUp vs input size
Chapter 4 Implementation Of Bitonic Sort Algorithm On CUDA 21
Figure 4.5: speedUp vs input size
Chapter 5
Implementation of Odd Even
transposition Sort
5.1 Introduction
The network odd-even transposition sort for n input data consists of n comparing
stages. In each stage, either all inputs at odd index positions or all inputs at even
index positions are compared with their next element. Odd stages are followed
by the even stages and only after the completion of an Odd stage an Even stage
can start and vice versa. It is similar to the bubble sort except for the fact that
odd-even transposition sort compares disjointed pairs by using alternating odd
and even index values during different phases of the sort.
5.2 The odd even merge sort is advantageous as
it can
1. Can be used for sorting on 2-D processor arrays and
2. Be parallely implemented which can achieve speed ups of more than 2.0 even
on marginally small number of elements.
23
24 Chapter 5 Implementation of Odd Even transposition Sort
5.3 Sequential Odd-Even Merge Sort:
The algorithm is simple to implement and is synonymous with bubble sort. In
the first phase of odd-even exchange, control jumps to all the even indices and
compare their neighbouring element. In the second phase control jumps to odd
indices and compares their neighbouring elements.These pair of phases continue
till the array is sorted. Thus, there are exactly half the number of pair of phases
as there are elements in the array to be sorted. The looping logic as follows
for (i = 0; i< n2; i=i+1 )
{
for (j = 0; j+1<n; j=j+2)
if (A[j]>A[j+1])
{
int T=A[j];
A[j]=A[j+1];
A[j+1]=T;
}
for (j = 1; j+1< n; j=j+2)
if (A[j]>A[j+1])
{
int T = A[j];
A[j] = A[j+1];
A[j+1] = T;
}
}
5.4 Parallel Odd Even Transposition Sort:
The odd-even transposition sort on CUDA architecture is implemented on a single
block with a max size of 512 elements. Each thread process one element and hence
even threads process even indexed elements and odd threads process odd indexed
elements.
Chapter 5 Implementation of Odd Even transposition Sort 25
5.4.1 Implemention
For an input size of N a block with N threads is created and each thread pro-
cesses one element.The kernel creates a shared memory portion for the block and
copies the array in this.All the values of the input array which are stored in global
memory are mapped with a specific thread ID dependent on the number of blocks
ID=blockIdx*dim block + threadIdx
Thus, respective elements are copied from the global memory to the shared mem-
ory for the block. The kernel then sorts the array in combinations of odd-even
phases and the resultant is copied back to the host memory.The kernel functioan
can be examined as follows.
5.5 Kernel Specification:
global Sort() - 8 registers,2068+16bytes of smem,4 bytes of cmem[1].
5.6 Salient Features:-
1. The use of shared memory to perform consecutive reads,which reduces the
time that would have been spent in performing the same reads and write
using global memory.
2. Performing a proper synchronization between threads operating in parallel
inside a block.
3. It was difficult to perform synchronization between diffrent blocks ,so the
sums of previous blocks were propagated to the consecutive blocks using a
global array.
4. Both the times t1(without considering memory copy overhead) and t2(considering
memory transfers overhead) are calculated.
5. Synchronization done as to ensure that during parallel execution of threads
the even phase always follows the odd phase
26 Chapter 5 Implementation of Odd Even transposition Sort
5.7 Limitations:
1. Maximum size of array can be 512, limited to maximum threads in a block
2. gpu-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY
CALCULATOR..
5.8 Observations:
1. Steep increase in speedUp as N increases.
2. Due to N being limited to 512 memory overhead time is less than calculation
time .Therefore less effect of memory overhead in performance graph
5.9 Conclusions:
1. Due to calculative complexity in sequential approach,the parallel approach
gains recognizable speedUp.
2. Due to N being limited to 512 memory overhead time is less than calculation
time .Therefore less effect of memory overhead in performance graph
Figure 5.1: Execution time vs input size
Chapter 5 Implementation of Odd Even transposition Sort 27
Figure 5.2: speedUp vs input size
Chapter 6
Implementation Of Parallel
Quicksort By Regular Sampling
Algorithm On CUDA
6.1 Introduction
Quicksort (also known as partition -exchange sort)is a very well known sorting
algorithm developed by A.R Hoare.It is a comparison sort and in effiecient im-
plementations ,is not a stable sort.Quicksort tends to make a excellrnt usage of
memory heirarchy ,taking a perfect advantage of virtual memory and availible
caches .It is very well suited for modern computer architectures ,as it uses no
temporaray memory and thus is a in-place sort.
6.2 Sequential Quicksort:
The sequential implementation of quicksort algorithm follows a divide and conquer
approach to sort a large input array of values.Th procedure involves:-
1. Selecting one of the numbers (any random numbermay be selected) from the
input as pivot element.
2. Locating the index(position) of the number in the input array and then di-
viding the array into sub-arrays .the Lower sub array contains elements with
29
30Chapter 6 Implementation Of Parallel Quicksort By Regular Sampling
Algorithm On CUDA
value smalller then the pivot,and the upper sub array containing elements
with values higher then that of the pivot element .
3. Applying the step one recursively on both the lower and upper arrays.
4. Finally a sorted list of values is obtained(sorted here in ascending order).
ILLUSTRATION OF QUICKSORT
Figure 6.1: Sequential Quicksort algorithm
Quicksort is known to be the fastest sorting algorithm based on comparison of
pivots, in the average case and Quicksort has some natural concurrency(sorting
the lower and upper list concurrently).
6.3 Parallel Quicksort Using Regular Sampling:
Parallel quicksort using regular sampling can be applied on a very large sets of
data .It basically involves segmenting the unsorted list into blocks.The unsorted
list is evenly distributed among the blocks.There are in all four phases invloved :-
1. Individual sorting of values on each segment ,selecting data items at local
indices 0, np2,2∗np2
, . . . , (p−1)np2
as a regular sample of its locally sorted block.
2. All the selected pivots are then again sorted and (p-1) pivots are selected
and broadcast to every block.
3. Each Block then partitions its sorted subarray into P disjoint partitions
4. Each Block (i) keeps its (ith) partition and sends the (jth) partition to process
(j), for all (j 6=i) and then each block merges its P partitions into a single
global array.
Chapter 6 Implementation Of Parallel Quicksort By Regular Sampling
Algorithm On CUDA 31
6.3.1 Implementation:
1. The input Unsorted list is divided into N blocks ( size512
) .and the unsorted
partitions are then copied from the global array to the shared array of each
block on the GPU.
2. Sorting of the segemented list stored in shared array is performed by every
block independent of each other
3. Local pivots are selected and copied to a global array ,indexed according to
the blockId.
4. The list of pivots is then again sorted and P-1 pivots are agin selected and
brodcast to every block.
5. Local sorted arrays are partioned according to the pivots and then the par-
titions are merged to a global array accordingly.
6.4 Kernel Specifications:
1. kernel1 6 registers,6810+16 bytes smem,4 bytes cmem
2. kernel2 8 registers,24+16 bytes smem,4 bytes cmem
3. kernel3 7 registers,2084+16 bytes smem,8 bytes cmem
6.5 Salient features:
1. The use of shared memory to perform consecutive reads,which reduces the
time that would have been spent in performing the same reads and write
using global memory.
2. The code is generalised to run on very large number of values.
3. Better load balance
4. Repeated communications of a same value are avoided
5. Use of three kernel functions to increase the extent of parallelization at the
same time continuosly using shared memory.
32Chapter 6 Implementation Of Parallel Quicksort By Regular Sampling
Algorithm On CUDA
6.6 Limitations:
1. The input size is limited to be taken in multiples of 512.
2. The sorting of segmented array performed at block level is implemented using
a single thread ,this affecting the overall efficiency and reducing parallelism.
3. Better load balance
4. There is a constant use of global memory for broadcasting the pivots and
globally sorting them
5. GPU-Occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY
CALCULATOR..
6.7 Observations:
1. Highly efficient and recursive sequential code
2. Use of three kernels drastically increses the execution time
6.8 Conclusions:
1. Efficient sequential codes can outperform the parallel versions.
Chapter 6 Implementation Of Parallel Quicksort By Regular Sampling
Algorithm On CUDA 33
Figure 6.2: execution time vs input size
Figure 6.3: speedUp vs input size
34Chapter 6 Implementation Of Parallel Quicksort By Regular Sampling
Algorithm On CUDA
Figure 6.4: speedUp vs input size
Chapter 7
Implementation of matrix
transpose algorithm on CUDA
7.1 Introduction
Matrix transpose is a operation in which we exchange the rows with there corre-
sponding column i.e values in row 1st becames the values of column 1st. Transpose
can opnly be found for a square matrix i.e both the dimension of matrix should be
same. The matrix transpose can be calculated for a very large sets of input data
and is generally described as below:-
INPUT: Matrice A having N*N dimension
OUTPUTMatrice transpose(A) having same dimensions. 1st row of A must match
with 1st column of transpose (A) and so on.
Example:-
matrix A=
1 2 3
4 5 6
7 8 9
transpose (A)=
1 4 7
2 5 8
3 6 9
35
36 Chapter 7 Implementation of matrix transpose algorithm on CUDA
7.2 Matrix transpose proves to be advantageous
in the implementation of following logics:
1. Used to find inverse of matrix.
2. Orthogonal matrice applications.
7.3 Sequential matrix transpose:
The logic for sequential is pretty straight-forward as the rows and colum are ex-
changed hence basically we have swapped there two indexs, i.e
A[i][j]=transpose(A[j][i]);
thus we have to index our program to follow the above logic.
for(i=0;i< r1;i=i+1)
{
for(j=0;j< c1;j=j+1)
{
transpose(A[j][i])=A[i][j];
}
}
here r1= number of rows in A matrice and c1 number of column and we know
both must be equal as its a square matrix
7.4 Parallel matrix transpose:
As matrice A and transpose(A) are different and thus we can launch as many
threads as there are number of element and thus we dont even have to synchronize
them.
7.4.1 Implementation:
For a input array of size N(can be very large),a 2-D grid is created . If the square
of N<512 ,then a grid with one block and containing N*N number of threads
Chapter 7 Implementation of matrix transpose algorithm on CUDA 37
is launched by the kernel function.If (N*N>512) then number of block launched
are N∗N256
and a 2-D block of each with dimesion 16 is launched.Indexing to each
element is done using the threadIdx, threadIdx , blockIdx and blockIdx.
Indexing int row = blockIdx*block D +threadIdx; int col = blockIdx*block D
+threadIdx;
7.5 Kernel specifications:
global void matrixMul globalmemory - 9 registers,28+16 bytes of smem,4
bytes of cmem[1].
7.6 Salient features:
1. we have implemented on global memory as our threads are independent of
each other and we face no synchronisation problem.
2. Motivation for using global memory was to run our code for matrices with
large dimensions.
3. The code is generalised to run on very lage number of values.
4. Both the times t1(without considering memory copy overhead) and t2(considering
memory transfers overhead) are calculated .
7.7 Limitations:
1. For lage values of arrays(>512 values),the input size was limited to the mul-
tiples of 512.
2. gpu-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY
7.8 Observations:
1. As N increases calculationtimememoryoverhead
decreases significantly.This is due to simple
calculative logic.
2. Due to memory overhead speedUp did not increase beyond 0.91.
38 Chapter 7 Implementation of matrix transpose algorithm on CUDA
7.9 Conclusions:
1. SpeedUp in calculations at (CPU vs GPU) easily achieved
2. Better memory optimizations can gain significant speedUp.
Figure 7.1: Execution time vs input size
Figure 7.2: speedUp vs input size
Chapter 7 Implementation of matrix transpose algorithm on CUDA 39
Figure 7.3: speedUp vs input size
Chapter 8
Implementation of parallel sum
algorithm on CUDA
8.1 Introduction
Parallel sum is the program to find out the sum of all the elements present in an
array. The parallel sum can be calculated for a very large sets of input data and
is generally described as below:-
INPUT:
For a set on N values [a1,a2,a3,.....................................,an-1,an]
OUTPUTWe will get the final sum of array say SUM=a1+a2+a3...........................+
an-1 + an;
For Example - a[8]={1,3,4,2,6,3,7,1}
SUM={1+3+4+2+6+3+7+1}=27
8.2 Parallel-sum proves to be advantageous in
the implementation of following logics:
1. In the implementation of finding out mean of set of values .
2. In the implementation of finding of variance.
41
42 Chapter 8 Implementation of parallel sum algorithm on CUDA
8.3 Sequential Parallel-Sum Algorithm:-
The sequential parallel-sum algorithm is a very simple method to calculate the
total sum of a given input array of numbers ,just by looping through the size of
the array and adding the current value with the variable sum .The logic is demon-
strated as below:-
SUM=0;
for( i=0;i<size;i=i+1)
SUM=a[i]+SUM ;
This code performs exactly N adds for a array of size N.and thus is a very simple
implementation.
8.4 Parallel Prefix-Sum:
The prefix-sum algorithm can be very efficiently performed using the parallel ar-
chitecture.We assume our size of input array to in form of powers of two, i.e 2 ,4
,16 , 32 ....1024,...8192...and so on.
8.4.1 Implementation:
For a input array of size N(can be very large),a single dimension grid is created
with ( N512
) blocks.If the size of the input is N<512 ,then a grid with one block and
containing N number of threads is launched by the kernel function.
Basically in kernel function each thread executes it code by performing the sum of
two elements and storing that sum in the index of number with lower index. For
example:
if we have input array say A={1,2,3,4,5,6,7,8} Now in first run for 8 values
we create 4 threads first thread,i.e thread with the threadIdx=0 adds the value
(a[0]=a[0]+a[1]=1+2=3) and stores it at the lower index i.e 0;similarly second
thread(threadIdx=1) adds the value (a[2]=a[2]+a[3]=3+4=7) third thread(threadIdx=2)
adds the value (a[4]=a[4]+a[5]=5+6=11) fourth thread(threadIdx=3) adds the
value (a[6]=a[6]+a[7]=7+8=15) Now the number of values have reduced from 8
to 4 now we require only 2 threads instead of 4. this is done by using thread Ids
Chapter 8 Implementation of parallel sum algorithm on CUDA 43
of threads. Condition: if((int)(threadIdx)-power(j)geq0 here j denotes the value
of run i.e for 1st run its equal to 0 for second run its equal to 1 and so on. As we
observe each time number of values reduces by 2. thus to compute the sum of N
value we need log to the base 2 of value N.
Each of the blocks is provided with a shared array of size=512 and its set of shared
variables.All the values of the input array which are stored in global memory are
mapped with a specific thread ID dependent on the number of blocks
ID=blockIdx*dim block + threadIdx;
Proper synchronisation must be insured between the diffrent run of threads. We
have used the standard function from CUDA library ( syncthreads();)
8.5 Kernel Specification:-
1. global void sum()- 5 registers,2076+16 bytes of smem, 8bytes of cmem[1].
8.6 Salient Features:-
1. The use of shared memory to perform consecutive reads,which reduces the
time that would have been spent in performing the same reads and write
using global memory.
2. Performing a proper synchronization between threads operating in parallel
inside a block.
3. The code is generalised to run on very lage number of values.
4. Both the times t1(without considering memory copy overhead) and t2(considering
memory transfers overhead) are calculated .
8.7 Limitations:-
1. We have assumed that the number of input value must be in form of 2’s
power.
44 Chapter 8 Implementation of parallel sum algorithm on CUDA
2. We can run it for large values untill the condition of maximum number of
blocks occurs, i.e we can have at max number of blocks is 65536 thus we
can compute parallel sum of array having 65536*512=33554432 number of
elements.
3. GPU-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY
CALCULATOR..
8.8 Observations:
1. For very small input sizes ,the sequential sum appears to be much faster
then the parallel code .
2. Good speedup wrt. kernel execution times are achieved,which demostrates
the efficiency of running the parallel code on CUDA.
8.9 Conclusions:
(a) use of shared memory requires extreme synchronization logics.
(b) bank conflicts very comon due to unrestricted access of shared memory.
Figure 8.1: Execution time vs input size
Chapter 8 Implementation of parallel sum algorithm on CUDA 45
Figure 8.2: speedUp vs input size
Figure 8.3: speedUp vs input size
Chapter 9
Calculation Of Variance and
Standard Deviations on CUDA
9.1 Introduction
The mean of a data set is simply the arithmetic average of the values in the
set, obtained by summing the values and dividing by the number of values.
The mean is a measure of the center of the distribution. The variance is used
as a measure of how far a set of numbers are spread out from each other. It
gives a measure of how away or far the numbers lie from their mean. The
variance of a data set is the arithmetic average of the squared differences
between the values and the mean
Standard deviation gives a measure of how much variation or dispersion is
there from the mean. Mathematically it is the square root of the variance.
The variance and the standard deviation are both measures of the spread of
the distribution about the mean
9.2 Finding VARIANCE AND DEVIATION
proves to be advantageous
(a) The spread of the data around the mean is to be found
(b) When large data is to be analyzed on the basis of extent of the spread
in the data
47
48 Chapter 9 Calculation Of Variance and Standard Deviations on CUDA
(c) For example, the margin of error in polling data is determined by calcu-
lating the standard deviation in the results if the polling is to be done
multiple times.
9.3 Sequentially Calculate Variance and SD:
the sum is easily calculated by adding each element of the N sized array and
the mean is found by dividing this dum by N.
for(i=0; i<n; i=i+1)
{
sum = sum + A[i];
}
avrg = sumn
;
the variance is then calculated using the deviation from this mean value by
using the formula stated above. The looping would be: for(i=0; i<n; i++)
{
sum1+=(A[i]-avrg)*(A[i]-avrg);
}
var = sum1n
;
SD =√
(var)
the SD is the standard deviation which is the square root of the variance.
9.4 Parallely Calculate Variance and SD:
The process of finding the sum parallely on CUDA is a complex one due to
synchronization problems. The sum is calculated using the kernel described
in chapter 3. the sum gives the average by dividing the sum by N and this
is used by the 2nd kernel for the calculation of variance and SD.
9.4.1 Implementation:
For a input array of size N(can be very large),a single dimension grid is
created with ( N512
) blocks.If the size of the input is N<512 ,then a grid with
Chapter 9 Calculation Of Variance and Standard Deviations on CUDA 49
one block and containing N number of threads is launched by the kernel
function. Each of the blocks is provided with a shared array of size=512 and
its set of shared variables.All the values of the input array which are stored
in global memory are mapped with a specific thread ID dependent on the
number of blocks
ID=blockIdx*dim block + threadIdx;
Thus, respective elements are copied from the global memory to the shared
memory of each block. The average calculated by kernel 1 is passed on to
the kernel 2 and the variance of each block is calculated and stored in an
array. Its summation gives the variance of the data and the square root of
the variance gives the SD.
9.5 Kernel Specification:
(a) global void sum()- 5 registers,2076+16 bytes of smem, 8bytes of cmem[1].
9.6 Limitations:
(a) For lage values of arrays(>512 values),the input size was limited to the
multiples of 512.
(b) GPU-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY.
9.7 Observations:-
(a) For very small input sizes ,the sequential prefix sum appears to be much
faster then the parallel code .
(b) With the increase in size of the input ,time taken by sequential code
increases almost linearly,whereas the time taken by the kernel to execute
remains a constant ,but the overall performance of the parallel code is
degraded by the time acoounted for memorycopy overhead between host
and device .
(c) Very large speedup wrt kernel execution times are achieved,which de-
mostrates the efficiency of running the parallel code on cuda ,but the
memory overhead for large values limits the overall speed up .
50 Chapter 9 Calculation Of Variance and Standard Deviations on CUDA
9.8 Conclusions:
(a) Finding the mean, variance and SD sequentially is of the O(n). hence
there is no speed up achieved as the kernel for finding the sum has
synchronization problems to be met.
(b) Memory optimization techniques can be used to control synchronization
of shared memory and speed up may be achieved but not guaranteed.
Figure 9.1: Execution time vs input size
Figure 9.2: speedUp vs input size
Chapter 9 Calculation Of Variance and Standard Deviations on CUDA 51
Figure 9.3: speedUp vs input size
Chapter 10
Data of Algorithms
The CPU we used has the following specifications:
Processor : Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz
Memory : 1GB DDR2 RAM
L2 Cache : 4 MB
The Nvidia Quadro FX 1700 GPGPU we used has the following specifica-
tions:
CUDA Parallel Processor Cores : 32
Memory Size : 512 MB
Memory Interface : 128-bit
Graphics Memory Bandwidth : 12.8 GB/sec
The graphics card used for our experiment (Quadro FX 1700) is of compute
capability 1.1. The version does not support double precision floating point.
Also,the mathematical functions used are not accurate. This leads to mild
loss of accuracy in the final results.
53
54 Chapter 10 Data of Algorithms
Input SeqEx-time PEx-time1 PEx-time1 Speed-up1 Speed-up24 1 43 67 21 28 554 924 1009 0.17 0.1516 2360 2118 2250 0.60 0.5532 9414 8160 8405 1.11 1.0564 37784 32041 32486 1.18 1.01128 131292 133058 133952 0.99 0.98256 538807 526462 528415 1.02 1.02512 2378744 2118810 2122760 1.12 1.121024 11560038 8538991 8547882 1.35 1.352048 52087845 34331100 34357273 1.52 1.52
Table 10.1: Matrix Multiplication(time in 10−6s)
Input SeqEx-time PEx-time1 PEx-time1 Speed-up1 Speed-up24 1 51 190 0.02 0.018 2 61 200 0.03 0.0116 6 77 226 0.08 0.0332 13 94 243 0.14 0.0564 33 120 280 0.28 0.12128 77 147 297 0.52 0.26256 179 182 332 0.98 0.54512 423 251 402 1.69 1.05
Table 10.2: Bitonic Sort Algorithm (time in 10−6s)
Input SeqEx-time PEx-time1 PEx-time1 Speed-up1 Speed-up216 1 76 97 0.01 0.0132 1 68 93 0.01 0.0164 2 75 98 0.03 0.02128 3 81 105 0.04 0.03256 4 98 123 0.04 0.03512 7 146 172 0.05 0.041024 14 151 179 0.09 0.082048 28 151 179 0.19 0.164096 54 250 301 0.22 0.188192 108 467 553 0.23 0.2016384 215 934 1075 0.23 0.232768 430 1958 2266 0.22 0.1965536 858 4503 5087 0.19 0.17262144 2956 33192 35403 0.09 0.08524288 5922 107130 111562 0.06 0.05
Table 10.3: Prefix Sum (time in 10−6s)
Chapter 10 Data of Algorithms 55
Input SeqEx-time PEx-time1 PEx-time1 Speed-up1 Speed-up24 1 43 67 0.02 0.018 1 47 73 0.02 0.0116 3 50 73 0.06 0.0432 9 58 83 0.16 0.1164 32 78 105 0.41 0.30128 113 144 67 0.96 0.78256 446 282 67 1.75 1.58512 1786 838 67 2.21 2.13
Table 10.4: Odd-Even Transposition Sort (time in 10−6s)
Input SeqEx-time PEx-time1 PEx-time1 Speed-up1 Speed-up24 1 63 87 0.02 0.0116 2 177 201 0.01 0.0132 3 549 574 0.01 0.0164 7 1023 1030 0.01 0.01256 32 2000 2018 0.02 0.02512 68 2608 2646 0.03 0.031024 144 4608 4698 0.03 0.032048 290 7500 7568 0.04 0.048192 1252 17062 17124 0.07 0.0732768 5392 29865 29936 0.18 0.18131072 23079 92452 92498 0.25 0.25
Table 10.5: Quicksort (time in 10−6s)
Input SeqEx-time PEx-time1 PEx-time1 Speed-up1 Speed-up24 1 45 71 0.02 0.018 1 45 72 0.02 0.0116 3 46 74 0.07 0.0432 8 47 79 0.17 0.1064 33 58 113 0.57 0.29128 127 147 298 0.86 0.43256 458 411 1045 1.11 0.44512 2057 1528 3748 1.35 0.551024 9906 6076 13738 1.36 0.722048 45757 24740 50454 1.85 0.914096 202233 206395 307262 0.98 s0.82
Table 10.6: Matrix-transpose (time in 10−6s)
56 Chapter 10 Data of Algorithms
Input SeqEx-time PEx-time1 PEx-time1 Speed-up1 Speed-up216 1 53 80 0.02 0.0164 1 56 79 0.02 0.01256 2 70 98 0.03 0.02512 3 87 113 0.03 0.031024 6 89 117 0.07 0.054096 22 136 190 0.16 0.128192 41 227 312 0.18 0.1316384 83 418 572 0.20 0.1532768 168 768 1102 0.22 0.15262144 1155 5829 8045 0.20 0.141048576 4650 23189 30856 0.20 0.154194304 18459 92547 118429 0.20 0.1616777216 73597 369930 470787 0.20 0.16
Table 10.7: Summation Algorithm (time in 10−6s)
Input SeqEx-time PEx-time1 PEx-time1 Speed-up1 Speed-up2512 7 272 294 0.03 0.021024 13 273 297 0.05 0.042048 25 281 311 0.09 0.084096 49 380 418 0.13 0.128192 98 582 639 0.17 0.1516384 166 997 1090 0.17 0.1532768 384 1767 1976 0.22 0.1965536 767 3415 3813 0.22 0.20131072 1323 6666 7427 0.20 0.18262144 2639 13148 14643 0.20 0.181048576 10651 51131 55262 0.21 0.194194304 42947 200280 214462 0.21 0.2016777216 171841 799691 854827 0.21 0.20
Table 10.8: Variance and SD (time in 10−6s)
Bibliography
57