_liner Solver Gpu Otimizado

OPTIMIZING ALGORITHM OF SPARSE LINEAR SYSTEMS ON GPU

DONGXU YAN, HAIJUN CAO, XIAOSHE DONG, BAO ZHANG, XINGJUN ZHANG

Department of Computer Science and Technology, Xian Jiaotong University, Xian 710049, China E-MAIL: [email protected]

Abstract: Linear equations with large spare coefficient matrices

arise in many practical scientific and engineering problems. Previous sparse matrix algorithms for solving linear equations based on single-core CPU are highly complex and time-consuming. To solve such problems, aiming at Jacobi iteration algorithm, in this paper we firstly implement a sparse matrix parallel iteration algorithm on a hybrid multi-core parallel system consisting of CPU and GPU; then an optimization scheme is proposed to carry out performance improvement in two ways, i.e., the multi-level storage structure and the memory access mode of CUDA. Experimental results show that the parallel algorithm on hybrid multi-core system can gain higher performance than the original linear Jacobi iteration algorithm on CPU. In addition, the optimization scheme is effective and feasible.

Keywords: Sparse linear systems; CSR; GPU; Jacobi iteration

1. Introduction

The large sparse matrix algorithm for solving linear equations is widely used in various scientific and engineering fields. As the scale of linear equations increases, the spare matrix algorithm requires huge computing power. However, previous sparse matrix algorithms based on single-core CPU are highly complex and time-consuming.

Currently, the multi-core technology has gained more attentions for a single computing component with two or more independent actual processors (cores). On the one hand, multi-core CPU has become mature and is widely adopted in computer systems. On the other hand, recently GPU with large-scale cores has appeared a great leap-forward progress. The large scale of computational core in GPU enables them to provide a stronger computing power. For example, NVIDIAs Tesla C1060 provides a peak computing power of 933Gflops, while AMDs HD4870 provides 1.2Tflops. The strong computing power of GPU gives a chance to build heterogeneous parallel systems composed of CPU and GPU, which behaves as a form of main core plus coprocessor architecture. In this

architecture, both of CPU and GPU are able to finish their own work well in order to achieve the overall performance promotion. At present, the hybrid parallel architecture has been extensively adopted in high performance computers. For instance, the supercomputer of Tianhe-1 consists of the Intel Xeon X5670 CPU (6 cores), NVIDIA GPU M2050 (480 cores), and FT-1000 CPU (8 cores).

Therefore, how to take advantage of parallel computing paradigm to optimize those previous linear algorithms for large sparse matrix becomes a promising research problem. In this paper, we firstly implement a sparse matrix parallel iteration algorithm on a hybrid multi-core parallel system consisting of CPU and GPU; then an optimization scheme is proposed to carry out performance improvement in two ways, i.e., multi-level storage structure and access mode of CUDA.

The rest of this paper is organized as follows. Section 2 presents some related works on sparse matrix algorithm and hybrid parallel system based on CPU and GPU. Section 3 introduces the CUDA (Compute Unified Device Architecture). In section 4, based on CUDA, we implement the Jacobi iteration algorithm on GPU, and then to optimize it. Section 5 makes the performance evaluation of the proposed parallel algorithm and the optimization scheme. We conclude with a summary and discuss future work in section 6.

2. Related works

For the sparse matrix from linear equations, a number of methods have been developed, such as iterative method [6], and the direct method. Regarding the large size of sparse matrix, the main objective is to reduce the sparse matrix storage and computation time. With the development of parallel computer, large-scale scientific and engineering computing pursues higher computing speed through parallelism. Therefore, computing the sparse matrix for linear equations in parallel is gaining increasing attention. Some research works have been dedicated to transform the traditional algorithms to adapt parallel computing. This paper mainly focuses on the Jacobi iteration algorithm for

2011 Sixth Annual ChinaGrid Conference

978-0-7695-4472-4/11 $26.00 2011 IEEEDOI 10.1109/ChinaGrid.2011.45

175

2011 Sixth Annual ChinaGrid Conference

978-0-7695-4472-4/11 $26.00 2011 IEEEDOI 10.1109/ChinaGrid.2011.45

174

linear equations with large sparse matrix. Meanwhile, with the introduction of Nvidias CUDA,

the GPU shows better programmability. Taking advantage of GPUs parallel computing features to experimental sparse matrix linear equations related issues is also increasingly become a hot research issue [3]. Paper [1] [2] have optimized product algorithm of sparse matrix and vector operations (SPMV) based on GPU with regard to special format sparse matrix from several aspects. This optimization method has improved performance. In paper [10], the SPMV algorithm is performed further optimization. After optimization, the parallel SPMV can gain 2~8 times of speedup. The issue of solving the sparse matrix linear equations in this paper has some similarities with SPMV. Both of these two algorithms use the sparse matrix in the computation. Based on above literature, this paper implement a parallel Jacobi iteration algorithm on GPU and makes further optimization.

3. Introduction to CUDA

A new solution about General Purpose computing on Graphics Processing Units (GPGPU) called CUDA. CUDA provides a programming model that is ANSI C, extended with several keywords and constructs. The programmer writes a single source program that contains both the host (CPU) code and the device (GPU) code [7]. These two parts will be automatically separated and compiled by the CUDA compiler tool chain.

CUDA allows the programmer to write device code in C functions called kernels [9]. A kernel is different from a regular function in that it is executed by many GPU threads in a Single Instruction Multiple Data (SIMD) fashion. Each thread executes the entire kernel once. Launching a kernel for GPU execution is similar to calling the kernel function, except that the programmer needs to specify the space of GPU threads that execute it, called a grid. Each GPU thread is given a unique thread ID that is accessible within the kernel, through the built-in variables blockIdx and threadIdx.

They are vectors that specify an index into the block space (that forms the grid) and the thread space (that forms a block) respectively. Each thread uses its ID to select a distinct vector element for addition. It is worth noting that blocks are required to execute independently because the GPU does not guarantee any execution order among them. However, threads within a block can synchronize through a barrier [8].

GPU threads have access to multiple GPU memories during kernel execution. Each thread can read and/or write its private registers and local memory (for spilled registers).

With single-cycle access time, registers are the fastest in the GPU memory hierarchy. In contrast, local memory is the slowest in the hierarchy, with more than 200-cycle latency. Each block has its private shared memory. All threads in the block have read and write access to this shared memory, which is as fast as registers. Globally, all threads have read and write access to the global memory, and read-only access to the constant memory and the texture memory. These three memories have the same access latency as the local memory.

Local variables in a kernel function are automatically allocated in registers (or local memory). Variables in other GPU memories must be created and managed explicitly, through the CUDA runtime API. The global, constant and texture memory are also accessible from the host. The data needed by a kernel must be transferred into these memories before it is launched. Note that these data are persistent across kernel launches. The shared memory is essentially a cache for the global memory, and it requires explicit management in the kernel. In contrast, the constant and texture memory have caches that are managed by the hardware.

To write a CUDA program, the programmer typically starts from a sequential version and proceeds through the following steps:

1. Identify a kernel, and package it as a separate function.

2. Specify the grid of GPU threads that executes it, and partition the kernel computation among these threads, by using blockIdx and threadIdx inside the kernel function.

3. Manage data transfer between the host memory and the GPU memories (global, constant and texture), before and after the kernel invocation. This includes redirecting variable accesses in the kernel to the corresponding copies allocated in the GPU memories.

4. Perform memory optimizations in the kernel, such as utilizing the shared memory and coalescing accesses to the global memory [8, 11].

5. Perform other optimizations in the kernel in order to achieve an optimal balance between single-thread performance and the level of parallelism [11].

Note that a CUDA program may contain multiple kernels, in which case the procedure above needs to be applied to each of them. Most of the above steps in the procedure involve significant code changes that are tedious and error-prone, not to mention the difficulty in finding the right set of optimizations to achieve the best performance [11]. This not only increases development time, but also makes the program difficult to understand and to maintain. For example, it is non-intuitive to picture the kernel computation as a whole through explicit specification of what each thread does. Also, management and optimization

176175

on data in GPU memories involve heavy manipulation of array indices, which can easily go wrong, prolonging program development and debugging [12].

For CUDA platform, we can summarize the three main features:

1. CUDA platform use SIMT (Single Instruction Multiple Thread) execution model, which is very different with the conventional SIMD execution model. In the SIMT, registers of each thread are private, and threads can communicate with each other through the memory synchronization mechanisms.

2. CUDA platform provides a sophisticated memory model. Therefore, whether reasonable exploitation of various memories utilization or not will have a significant impact on the memory access speed and performance.

3. In actual operation, SM can create, manage, schedule and execute the thread in warp. So we can use the warpsize to improve performance.

GPU only play an important role in a high degree of data parallel tasks. Such calculations are characterized by: a large amount of data to process, the data stored as grid or matrix. And the processing of these data is same. Such classic examples of data parallel problems are: image processing, physical model simulation (such as computational fluid dynamics), engineering and financial simulation and analysis, search, sort, etc.

Kernel_1

serial code

Kernel_2

GRID 1

BLOCK(0,0) BLOCK(0,1)

BLOCK(1,0)

BLOCK(2,0)

BLOCK(3,0)

BLOCK(1,1)

BLOCK(2,1)

BLOCK(3,1)

BLOCK

Thread(0,0) Thread(0,2)Thread(0,1) Thread(0,3)

Thread(1,0) Thread(1,1) Thread(1,2) Thread(1,3)

Thread(2,0) Thread(2,1) Thread(2,2) Thread(2,3)

Fig.1 CUDA Programming Model

4. Parallel Iterative Algorithm and optimization

4.1. CSR storage format

The storage format of sparse matrix is a crucial factor which plays a great important role on performance of large-scale sparse matrix linear equation. In order to save storage space and have easy access, several formats have emerged, such as diagonal storage method, bandwidth

storage method, variable bandwidth storage method, and ultra-coordinate matrix storage method [2].

The CSR (Compressed Sparse Row) storage method evolved from the coordinate storage format. An example of CSR storage format for sparse matrix A is shown as Fig.2. Because CSR can fully utilize characteristics of the sparse matrix, especially providing easily element access and query, it is widely used in solving large sparse matrix linear equations. Furthermore, sparse matrix with CSR storage format can provide parallelism. Therefore, it is suitable to take the advantage of parallel processing capability of GPU.

2 5 0 00 3 9 07 0 4 60 4 0 8

[2,5,3,9,7,4,6,4,8][0,1,1,2,0,2,3,1,3][0,2,4,7,9]

A

elementcolptr

Fig.2 CSR Storage format

4.2. Jacobi iterations for sparse matrix

For a linear equation with low order, the direct method is very effective. However, if the order is high as well as the coefficient matrix is sparse, the direct method becomes difficult. This is because in this situation there are only few non-zero elements in the matrix; the direct method needs to store a large number of zero elements. To reduce the computation and save memory, the iterative method is proposed and regarded as much more favorable.

For an n-order linear equation AX = b, A is the sparse matrix, and aij is the matrix element. If the coefficient matrix is non-singular, and (i = 1, 2, ..., n), then the solution of equations based on the Jacobi iterative method [4] can be presented as the following equation:

( 1) ( )

1

1 ( 1, 2, , )n

k ki i ij j

jiij i

x b a x i na

The algorithm of Jacobi iteration is as follows:

Input: the coefficient matrix A, the constant vector b, the error limit , the initial solution vector xn .

Output: the final solution vector xn .

177176

Begin diff = ; while ( diff >= )

diff = 0 ; for ( i=1 to n )

newxi = bi ; for ( j=1 to n )

if ( j != i ) newxi = newxi aijxj

end if end for newxi = newxi / aii ;

end for for ( i = 1 to n )

diff = max { diff, |newxi-xi| }; xi=newxi

end for end while

End

In the above algorithm, there are two loops of for, one cycle is to get the value of newxi and the other is to judge the convergence of the iteration condition. However, there is no data dependence when computing newxi in the for cycle, they can be calculated in parallel. Because GPU can provide parallel computing with single instruction multiple data stream, we transplant the cycle to get values of newxi on GPU.

4.3. Algorithm optimization and analysis

In this paper, we take two steps to improve the performance of Jacobi iteration algorithm.

In the first step, based on CUDA, the Jacobi iteration algorithm is implemented on the hybrid parallel system (GPU plus CPU). In particular, the global memory is chosen to store data.

In the second step, we adopt above three laws summarized in Section 3 to optimize the program in step 1. On the one hand, due to frequently reading and writing data in the GPU memory, we define the shared on-chip memory to store the intermediate results and to reduce the memory access cycle. On the other hand, we put 16 threads of halfwarp to handle a line of the sparse matrix, trying to hide memory access latency. Therefore, the optimization not only provides a reasonable model of memory access but also uses warpsize to improve performance.

To distinguish the programs of the step 1 and the step 2, Program1 and Program2 are used to denote the outputs before optimization and after optimization respectively.

The following is the kernel function of Program2:

__global__ void solve_kernel (int totalnum, int len, float *gAa, int *gJa, int *gIa, float *gCon,

float *gSol, float *gas, float *gaerr) {

int bx = blockIdx.x ; int tx = threadIdx.x ; int tid = bx * 256 + tx ; __shared__ float sumid[256]; float sum = 0; int rowid = tid / HALFWARP; int colid = tid & (HALFWARP - 1); if ( tid < SIZE * 16 && rowid < SIZE ) {

sumid[tx] = 0; int begin = gIa[rowid]; int end = gIa[rowid + 1]; if (begin + colid < end)

sumid[tx] = gAa[begin+colid] * gSol [tex1Dfetch(textclum1, (begin + colid))];

__syncthreads(); if (colid

Table 1. Matrix collection

Name Row*Col Nozero Avg. Nozero /Row

tub1000 1000*1000 6992 6.9

bcsstk09 1083*1083 18437 17

bcsstk10 1086*1086 22070 20

bcsstk15 3948*3948 117816 29.8

bcsstk18 11948*11948 149090 12

bcsstk16 4884*4884 290378 59.45

bcsstk25 15439*15439 252241 16.3

Firstly, we make experiments to compare the execution time with different size of matrix in three cases, namely the Jacobi iteration algorithm on CPU (referred as CPU), the parallel Jacobi algorithm on GPU (referred as Program1), and the optimized Jacobi algorithm on GPU (referred as Program2). As shown in Fig. 3, with the growth of input matrix size, the execution time of the linear algorithm on CPU undergoes a sharply increase and is much higher than those two algorithms on GPU. This is because in Program1 and Program2, the calculation kernel is executed in parallel on GPU. In addition, when the size of input matrices is large enough, the optimized parallel algorithm Program2 can achieve better performance than the parallel algorithm Program1.

Fig.3 Jacobi iteration algorithm on CPU and GPU

Secondly, we make experiments to obtain the speedup gained by Program1 and Program2 with various sizes of input matrices. Speedup is defined as the ratio of the execution time between the linear program on CPU and the parallel program on GPU. As shown in Fig.4, although both Program1 and Program2 are parallel algorithms executed on GPU, Program2 outperforms Porgram1 in each case. This is mainly because compared with Program1, Program2 can fully take advantage of the memory access mode of CUDA and use warpsize to get higher parallelism. We also notice that when the sparse matrix has 290,378 non-zero element (bcsstk16), the performance optimized on GPU is not satisfactory. After tracing back to the dataset of bcsstk16, we find that there are 59 no-zero elements on

average in each row. That is to say, the matrix of bcsstk16 is not so sparse that the performance of Program1 will degrade in this case. Furthermore, it may cause the memory access of Program1 under great fluctuation. However, because the warpsize can provide higher parallelism, Program2 can obtain higher performance.

Fig.4 Speedup comparison of Program1 and Program2

6. Conclusions

A number of traditional algorithms are designed and implemented with the paradigm that the algorithm is executed by CPU sequentially. With the development of multi-core processor, especially GPU, it can provide higher performance by parallelism. In this paper, we analyze the existing Jacobi iteration algorithm for solving linear equations of large sparse matrix, and then implement the parallel algorithm on GPU and optimize it. Experimental results show that the parallel algorithm can gain higher performance and the optimization scheme is effective and feasible.

Acknowledgements

This paper is supported by 863 Project of China (Grant No. 2009AA01A135 and 2009AA01Z108) and the Fundamental Research Funds for the Central Universities under grant 08142007.

References: [1] Bell N., and Michael G. Effcient Sparse Matrix-Vector Multiplication on

CUDA, Nvidia Technical Report NVR-2008-004, Dec. 2008.

[2] Muthu M. B., and Rajesh B. Optimizing Sparse Matrix-Vector

Multiplication on CUDA, IBM Technical Report, RC24704, 2008.

[3] Zhouwei W., Xianbin X., Wuqing Z., Yuping Z., Shuibing H.

Optimizing Sparse Matrix-Vector Multiplication on CUDA, In:

proceeding of 2nd International Conference on Education Technology

and Computer (ICETC), 2010.

[4] Yousef S. Iterative Methods for Sparse Linear Systems. 2nd Edition,

SIMA, 2003.

179178

[5] Davis T. The University of Florida Spare Matrix Collection.

http://www.cise.ufl.edu/research/sparse/matrices/

[6] Abhijeet G., and Ioane M. T. Parallel Iterative Linear Solvers on GPU:

A Financial Engineering Case, In: proceeding of 18th Euromicro

Conference on Parallel, Distributed and Network-based Processing, 2010.

pp.607-614

[7] Owens J. D., Houston M., Luebke D., et al. GPU Computing, In:

Proceeding of the IEEE, 2008, 96(5): 879-897

[8] Nvidia Corp. CUDA Programming Guide 2.0. http://

developer.download.nvidia.com/compute/cuda/2_0/docs/NVIDIA_CUD

A_programming_Guide_2.0.pdf

[9] Halfhil TR. Parallel Processing With CUDA. Microprocessor Report,

2008.

[10] Chao M., Gang W., Song-Wen P., and Bai-Feng W. Improvement of

Sparse Matrix-Vector Multiplication on GPU, Computer systems and

application, 2010.19(5):116-120

[11] Shane R., Christopher I. R., Sara S. B., Sam S. S., David B. K., and

Wen-mei W. H. Optimization principles and application performance

evaluation of a multithreaded GPU using CUDA, In: Proceeding of

symposium on Principles and Practice of Parallel Programming, 2008,

pp. 73-82

[12] Tianyi D. H., Tarek S. A. hiCUDA: A High-level Directive-based

Language for GPU Programming, In GPGPU-2: Proc. 2nd Workshop

on General Purpose Processing on Graphics Processing Units, pages

5261. ACM, 2009

180179

Documents

_liner Solver Gpu Otimizado