ECSE 420 - Parallel Cholesky Algorithm - Report

1

ECSE 420 - Fast Cholesky: a serial, OpenMP andMPI comparative study

Renaud Jacques-Dagenais, Benedicte Leonard-Cannon, Payom Meshgin, Samantha Rouphael

I. INTRODUCTION

The aim of this project is to analyze different parallelimplementations of the Cholesky decomposition algorithmto determine which is most efficient in executing the oper-ation. In particular, we will study a shared memory-basedimplementation in OpenMP as well as a message-passingbased implementation using MPI, all programmed in the Cprogramming language. These parallel versions will also becompared against a plain, serial version of the algorithm.

To test our implementations of Cholesky decomposition,we employ a series of tests to gauge the performance of theprograms under varying matrix sizes and number of parallelthreads.

II. MOTIVATION

Many mathematical computations involve solving largesets of linear equations. In general, the LU decompositionmethod applies for any square matrix. However, for certaincommon classes of matrices, other methods are more efficient.One such method commonly used to solve these problemsis Cholesky decomposition. This method is notably used inengineering applications such as circuit simulation and finiteelement analysis.

Sadly, Cholesky decomposition is relatively demanding interms of computational operations. Indeed, the computationtime of the algorithm scales quite poorly with respect tomatrix size, with a complexity of order O(n3) [Gallivan].Luckily, much of the work can be parallelized, allowing forhigher performance in less time.

III. BACKGROUND THEORY

The Cholesky decomposition algorithm is a special versionof the LU decomposition, where the matrix to be factorizedis square symmetric and positive definite. The key differencebetween LU and Cholesky decomposition is that the upper tri-angular factor determined using the Cholesky decompositionmethod is forced to be equal to the transpose of the lowertriangular matrix, rather than an arbitrary matrix. For LLT

factorable matrices, i.e. matrices that can be represented asa product of a lower triangular matrix L and its transpose,the Cholesky algorithm has twice the efficiency of the LUdecomposition method.

The algorithm is iterative and recursive, such that aftereach major step in the algorithm, the first column and rowof the L factor are known and the algorithm is applied to theremainder of the matrix. Hence, the computational domain ofthe algorithm successively reduces in size.

We begin with an n x n, symmetric and positive definitematrix A. These properties ensure the matrix A is LLT

factorable [Heath]. Given the linear system Ax = b, Cholesky

factorization decomposes the matrix A into the form A =LLT , such that L is a lower triangular matrix and LT isits transpose. Once L is computed, the original system canbe solved trivially by solving the matrix equation L(y) = b(also called forward substitution), following by solving theequation LTx = y (also called backward substitution). For thescope of this project, we will strictly focus on parallelizing thefactorization algorithm rather than the entire task of solvingAx = b.

IV. IMPLEMENTATION

A. Serial Cholesky AlgorithmThere are three popular implementations of the Cholesky

decomposition algorithm, each based on which elements ofthe matrix are computed first within the innermost for loopof the algorithm [Gallivan]:• Row-wise Cholesky (row-Cholesky)• Column-Cholesky (column Cholesky)• Block Matrix-Cholesky (block Cholesky)In the case of the serial implemenation of the algorithm, the

column-wise version was used to limit the number of cachemisses on the system. The column-wise algorithm is shownbelow:

for j = 1 to n dofor k = 1 to j-1 do

for i = j to n doaij = aij − ajk;

endendajj =

√ajj ;

for i = j+1 to n doaij = aij/ajj ;

endend

Algorithm 1: Column-wise Cholesky algorithm

B. MPIIn order to have an efficient MPI implementation of

Cholesky decomposition, it is necessary to distribute com-putations in a way that minimizes communications betweenprocesses. Before we can figure out what the best solutionis, we have to identify the dependencies. By analyzing thealgorithm, we notice two things: previous columns need to beupdated before any subsequent column and, when updatingentries of the same column, the new value of diagonal entryhas to be computed first.

This is easier to understand with an example. Suppose wehave a 5x5 matrix and we wish to update entries in column

2

3 (entries in previous columns have already been updated).Then, the dependencies for each entry in column 3 are thefollowing (Figure 1).

Fig. 1. Dependency chart for the message passing implementation

Notice that the values that are needed to update an entry areeither located on the same row as the entry or on the row thatcontains the entry on the diagonal. Hence, decomposing thedomain into rows is probably the best solution to minimizecommunications between processes.

Cyclic assignment of rows to processes can easily beachieved by taking the remainder of the division of the columnnumber to the total number of processes (i.e. j%nprocessors).Note that we can take the column number instead of therow number because the matrices on which the algorithm isapplied are always square. For instance, if we take the same5x5 matrix as above and run the algorithm with 3 processes,rows would be assigned as follows:

Next we need to identify which parts of the algorithm canbe parallelized. Let us recapitulate how Cholesky decompo-sition works.• For every column:

1) Replace the entries above the diagonal with ze-roes

2) Update the entry on the diagonal3) Update entries below the diagonal

The first step is pointless to parallelize because it involvesno computations.

The second step could be parallelized. We could start bybroadcasting the row that contains the entry on the diagonal,then use MPI_Reduce() to compute the new value ofthe entry on the diagonal and broadcast again to send theupdated value to all processes. However, this would be terriblyinefficient because processes would spend a lot of timesending and receiving messages. It is more efficient to simplylet the process assigned to that row compute the new valuefor the entry on the diagonal and broadcast the whole rowonce.

The third step is where we gain performance by paralleliz-ing tasks. Since processes already have access to the data intheir respective row(s), we only need to broadcast the rowthat contains the entry on the diagonal. As soon as a processgets this data, it can proceed and update the entry(entries) thatcorrespond to its row(s). Thus, at this point in the program,processes all compute in parallel.

We also tried to broadcast each individual entry up to thediagonal. We thought this might yield better performancesbecause the total amount of data transferred is almost half.However, it turned out to decrease performances by a factor ofalmost 2! This is due to the communication overhead incurredwhen transferring data between processes.

Note: The reason why we chose to parallelize the row-Cholesky algorithm instead of the column-Cholesky algo-rithm is because there is no simple way to broadcast the

columns of a matrix. We thought about transposing the matrixbefore applying the algorithm but we figured that perfor-mances would probably be worse than by simply applyingthe row-Cholesky.

C. OpenMPThe OpenMP implementation of the Cholesky decom-

position algorithm is fairly straight-forward. Starting fromthe code for the serial implementation of the Choleskyalgorithm, only a few lines of code were changed to par-allelize the algorithm with OpenMP. Using the OpenMP#pragma parallel for directive before the innermostfor loop splits the single thread into a certain number ofshared-memory threads. Moreover, a set of structures forsynchronization (i.e. locks) were explicitly defined in thecode to avoid any potential race conditions. The result of themodifications are shown in the listings section of the report.

V. TESTS

This section outlines the testing methodology undertaken toanalyze and compare the serial, OpenMP and MPI implemen-tations. Each set of tests is executed on two 4-core machineswith the specifications displayed in the Appendix.

A. Test 1The purpose of the first test is to verify the correctness

of each of the three implementations. To ensure that eachprogram executes the right computations, we begin by de-composing a matrix A into its Cholesky decomposition, LLT .Then, we compare these two matrices based on the fact thatA should be equivalent to LLT , using the least absolute errorformula:

S =

n∑i=1

|yi − f(xi)| (1)

B. Test 2The second test is conducted to measure the accuracy of

the decomposed matrix generated by each implementation.To acquire this measure, a set of matrix operations is appliedboth to the original matrix A and the decomposed matrix Lto produce an experimental result x′ to be compared to atheoretical result x. More precisely, we begin by generating arandom matrix A and a unidimensional vector x of the samelength (x being the theoretical value). We then compute theproduct b of these matrices using the following equation:

Ax = b (2)

Subsequently, A’s Cholesky decomposition, LLT is gener-ated. Using these matrices and the result b previously found,the experimental value x′ is computed based on the followingequation:

LLTx′ = b (3)

Since L and LT are triangular matrices, finding x isstraightforward. Indeed, one can replace LTx′ by y and findfor y by solving this system of linear equations:

Ly = b (4)

3

Finally, x’ can be computed in a similar way using:

LTx′ = y (5)

The error between the theoretical and experimental vectorsx and x’ is computed by least absolute error :

S =

n∑i=1

|yi − f(xi)| (6)

C. Test 3The third test aims at comparing the performance of

the three implementations. To acquire this information, wemeasure the program execution time with respect to the matrixsize and the number of threads (the latter concerning OpenMPand MPI exclusively). The serial, OpenMP and MPI programsare tested with square matrices of sizes varying from 50 to5000, 50 to 4000 and 50 to 2000/3000 (see discussion formore details) respectively. The OpenMP and MPI programsare executed with the number of threads ranging from 2 to32 and 2 to 9 respectively.

The matrix size and number of threads are inputted asarguments before running the executables, while the executiontime is obtained by comparing the system time before andafter Cholesky decomposition and printing this value to thecommand prompt.

For each program execution, a new random matrix isgenerated. In order to ensure consistency between differentruns, each test case is conducted five times and the averageruntime is logged as the measurement.

VI. RESULTS

The following section summarizes the statistics acquiredafter executing each three of the aforementioned tests.

TABLE IIMEASURE OF THE CORRECTNESS OF EACH ALGORITHM

Serial OpenMP MPICorrectness Passed Passed Passed

The least absolute deviation between the Cholesky decom-position and the original matrix is computed with a precisionof 0.00000000001 for all implementations. For the serialalgorithm and the OpenMP and MPI versions with a matrixsize from 50 to 3000 and 1 to 9 threads, the results of thedeviation are always exactly 0. Therefore, all implementationsof the algorithm are correct, although it is not certain whetherthe parallelization of the task is optimal.

VII. DISCUSSION

In, this section, we discuss and analyze our findings basedon the data collected during tests 1, 2 and 3.

The results of the first test show that all three implemen-tation that we designed are correct, as expected. Indeed, theerror for each one of them is equal to exactly 0. This result isrequired to conduct further testing and comparison betweenthe different implementations since it shows that Choleskydecomposition is properly programmed for each version.

Surprisingly, the least square error measured in the secondtest is precisely 0 for all implementations when we expected

TABLE IIIBEST EXECUTION TIMES IN FUNCTION OF THE MATRIX SIZE FOR EACH

IMPLEMENTATION

Matrixsize

Serial executiontime (s)

OpenMP best exe-cution time (s)

MPI bestexecutiontime (s)

50 0.00043 0.00113 0.000229100 0.00464 0.00299 0.00291200 0.01739 0.01047 0.008708300 0.07026 0.03178 0.023265500 0.28296 0.14314 0.100161000 3.86278 0.94137 0.8334151500 11.31204 2.98861 2.8564362000 30.60744 6.92203 6.9272653000 113.79156 22.94476 22.4388514000 284.43736 54.20875 -5000 546,4589 - -

some loss of precision in the MPI implementation. This maybe explained by the fact that test matrices were diagonallydominant, and hence the condition number of the matriceswere very low. This property ensures that the decompositionyields little error, even though we didn’t expect it to yield noerror at all.

Fig. 2. Runtime of serial Cholesky algorithm

Using data acquired during the third testing phase, theexecution time versus the matrix size is plotted for eachimplementation. For the serial version, we observe that theexecution time increases cubically in function of the matrixsize, as displayed on figure 2. This result corresponds to ourexpectations since Cholesky decomposition is composed ofthree nested for loops resulting in a time complexity of O(n3).Similarly, we note that the OpenMP and MPI execution times(for a fixed number of threads) increase with respect to thematrix size, but less dramatically as displayed on Figures 3and 4. This results illustrates that parallelizing computationsimproves execution time.

Second, one can observe from Figure 3 that for mostof the conducted tests, the execution time of the OpenMPalgorithm for a fixed matrix size decreases with an increasingnumber of threads until 5 threads are summoned. With 5threads onwards, the execution time generally increases asmore threads are added. However, there is a notable exception;for matrix sizes above 1000, calling 9 threads often results ina longer execution time than that of 16 or 32 threads. Thismight be due to a particularity of Cholesky decomposition, oris more likely a result of the underlying computer architecture.

The speedup plotted on Figure 5 illustrates the ratio of

4

TABLE IEXECUTION TIME IN FUNCTION OF THE MATRIX SIZE AND NUMBER OF THREADS FOR THE OPENMP AND MPI IMPLEMENTATIONS

Fig. 3. Average OpenMP Implementation execution time

Fig. 4. Average OpenMPI Implementation execution time

the computation time of the OpenMP implementation versusits serial counterpart. As it can be observed, the speedupgrows linearly as the matrix size increases. Indeed, the largerthe matrix size, the more OpenMP outperforms the serialalgorithm since the former completes in O(n2) and the latterin O(n3).

It is very interesting to discuss an intriguing problemencountered with the OpenMP implementation. Initially, acolumn Cholesky algorithm was programmed and paral-

Fig. 5. Speedups of Parallel Implementations

lelized. After executing dozens of tests, we noted that theperformance improved as the number of threads was increasedup to 2000 threads -despite the program being run on a 4-core machine! We suspect that spawning an overwhelmingamount of threads forces the program to terminate prema-turely, without necessarily returning an error message. Asa result, increasing the amount of threads causes promptertermination and thus smaller execution times. To address thisproblem, a row Cholesky algorithm was programmed instead,which resulted in logical results as previously discussed. Wechose to completely change the Cholesky implementationfrom the column to the row version because OpenMP “maynot deliver the expected level of performance” with somealgorithms [Chapman].

Figure 4 shows the execution time of the MPI implementa-tion for varying matrix sizes and number of threads. As it canbe seen, the execution time for a fixed matrix size decreasesas the number of threads increases until the number of threadsequals 4 (corresponding to the number of cores on the testedmachine). From that point on, the execution time generallyincreases with the number of threads, and one can observethat this is consistent for different matrix sizes. An interestingobservation can also be made from Table I regarding theexecution time with more than 4 threads. When the numberof threads is set to an even number, the execution time isless than with the preceding and following odd numbers.

5

Once again, this might be a result of the underlying computerarchitecture.

The ratio of the computation time of MPI over the serialimplementation is plotted in Figure 5 and we can observethat as the matrix size increases, the speedup increaseslinearly. In fact, this confirms that MPI outperforms the serialimplementation by a factor of the matrix size n. Interestingly,we can conclude from these results that MPI’s behaviour isvery similar to that of OpenMP.

As previously implied, the best OpenMP and MPI perfor-mance for most matrix sizes is achieved with an amount of 4threads. This is due to the fact that tests were conducted ona 4-core machine; maximum performance is achieved whenthe number of threads matches the number of cores [JieChen]. Indeed, a too small amount of threads doesn’t fullyexploit the computer’s parallelization capabilities, while a toolarge amount induces important context switching overheadbetween processes [Chapman].

A. Serial/MPI/OpenMP comparisonAs expected, the parallel implementations result in a

much higher performance than their serial counterpart; obvi-ously, parallelizing independent computations results in muchsmaller runtimes than executing them serially. Surprisingly, intheir best cases, the OpenMP and MPI algorithms completeCholesky decomposition in perceptibly identical runtimes,with MPI having an advantage of a few microseconds, as seenon Figure 6. We were expecting OpenMP to perform better,since we believed that message passing between processeswould cause more overhead than using a shared memoryspace. However, we have not been able to explain this resulton a theoretical basis.

Fig. 6. Average OpenMPI Speedup

One can observe that the OpenMP and MPI implemen-tations were tested with a maximum matrix size of 4000and 3000 respectively, compared to that of 5000 for theserial implementation. This outcome is due to the fact thatrunning the program with a size higher than these valueswould completely freeze the computer because of memorylimitations. Therefore, the serial implementation seems tohave an advantage over the parallel algorithms when thematrix size is significantly large because it doesn’t requireadditional memory elements such extra threads, processes ormessages. Similarly, MPI was not tested with values of 16 and32 threads unlike OpenMP because these values would freezethe computer. We believe that this behavior is explained by thefact that MPI requires more memory for generating not onlyprocesses, but also messages, while OpenMP only requiresthreads.

Finally, it is essential to mention that it is much simpler toparallelize a program with OpenMP than with MPI [Mallon];the former only requires a few extra lines of code definingthe parallelizable sections and required synchronization mech-anisms (locks, barriers, etc), while the latter requires the codeto be restructured in its entirety before setting up the messagepassing architecture.

VIII. CONCLUSION

Throughout this project, we evaluated the performanceimpact of using different parallelization techniques for theCholesky decomposition. First, a serial implementation wasdeveloped and used as the reference base. Second, we usedmulticore programming tools, namely OpenMP and MPI, toapply different parallelization approaches. We inspected theresults of the different implementations by varying the matrixsize (between 0 to 4000) and the number of threads used(between 1 to 32). As expected, both parallel implementationsresult in a higher performance than the serial version. Also,we observed that MPI resulted in better execution times of afew microseconds over OpenMP.

In the future, we would like to experiment with a mixedmode of OpenMP and MPI in the hope of discovering an evenmore efficient parallelization scheme for Cholesky decompo-sition, since this scheme may increase the code performance.Moreover, we would like to conduct further tests with thecurrent implementations by running the different programs oncomputers with various amounts of CPUs, such as 8,16,32 and64. We would expect machines with more cores to providebetter execution times.

This project improved our knowledge regarding the dif-ferent parallelization tools that can be used to parallelize aprogram. In fact, we were able to apply the parallel computingtheories learned in class.

IX. REFERENCES

• D. Mallon, et al., “Performance Evaluation of MPI,UPC and OpenMP on Multicore Architectures,” in Re-cent Advances in Parallel Virtual Machine and MessagePassing Interface. vol. 5759, M. Ropo, et al., Eds., ed:Springer Berlin Heidelberg, 2009, pp. 174-184.

• J. Shen and A. L. Varbanescu, “A Detailed PerformanceAnalysis of the OpenMP Rodinia Benchmark,” Techni-cal Report PDS-2011-011, 2011.

• B. Chapman, et al., “Using OpenMP: Portable SharedMemory Parallel Programming,” The MIT Press, 2008.

• M.T. Heath. Parallel Numerical Algorithms course,Chapter 7 - Cholesky Factorization. University of Illi-nois at Urbana Champaign.

• K. A. Gallivan, et al., Parallel Algorithms for MatrixComputations: Society for Industrial and Applied Math-ematics, 1990.

• L. Smith and M. Bull, “Development of mixed modeMPI / OpenMP applications,” Sci. Program., vol. 9, pp.83-98, 2001.

6

X. APPENDIX

A. Test Machine Specifications1) Machine 1

• Hardware:◦ CPU: Intel Core i3 M350 Quad-Core @ 2.27 GHz◦ Memory: 3.7 GB

• Software:◦ Ubuntu 12.10 64-bit◦ Linux kernel 3.5.0-17-generic◦ GNOME 3.6.0

2) Machine 2• Hardware:◦ CPU: Intel Core i7 Q 720 Quad-Core @ 1.60 GHz◦ Memory: 4.00 GB

• Software:◦ Ubuntu 13.04 64-bit◦ Linux kernel 3.5.0-17-generic◦ GNOME 3.6.0

XI. CODE

A. matrix.c - Helper Functions

# i n c l u d e ” m a t r i x . h ”

/ / P r i n t a sq ua re m a t r i x .void p r i n t ( double ∗∗mat r ix , i n t m a t r i x S i z e ){

i n t i , j ;f o r ( i = 0 ; i < m a t r i x S i z e ; i ++) {

f o r ( j = 0 ; j < m a t r i x S i z e ; j ++) {p r i n t f ( ” %.2 f \ t ” , m a t r i x [ i ] [ j ] ) ;

}p r i n t f ( ”\n ” ) ;

}p r i n t f ( ”\n ” ) ;

}

/ / M u l t i p l y two sq ua re m a t r i c e s o f t h e same s i z e .double ∗∗ m a t r i x M u l t i p l y ( double ∗∗mat r ix1 , double ∗∗mat r ix2 , i n t m a t r i x S i z e ){

/ / A l l o c a t e s memory f o r a m a t r i x o f d o u b l e s .i n t i , j , k ;double ∗∗m a t r i x O u t = ( double ∗∗ ) m a l loc ( m a t r i x S i z e ∗ s i z e o f ( double ∗ ) ) ;f o r ( i = 0 ; i < m a t r i x S i z e ; i ++){

m a t r i x O u t [ i ] = ( double ∗ ) ma l l oc ( m a t r i x S i z e ∗ s i z e o f ( double ) ) ;}

double r e s u l t = 0 ;/ / F i l l each c e l l o f t h e m a t r i x o u t p u t .f o r ( i = 0 ; i < m a t r i x S i z e ; i ++){

f o r ( j = 0 ; j < m a t r i x S i z e ; j ++){/ / M u l t i p l y each row o f m a t r i x 1 w i t h each column o f m a t r i x 2 .f o r ( k = 0 ; k < m a t r i x S i z e ; k ++){

r e s u l t += m a t r i x 1 [ i ] [ k ] ∗ m a t r i x 2 [ k ] [ j ] ;}

m a t r i x O u t [ i ] [ j ] = r e s u l t ;r e s u l t = 0 ; / / R e s e t ;

}

7

}

re turn m a t r i x O u t ;}

/ / Add two sq ua re m a t r i c e s o f t h e same s i z e .double ∗∗ m a t r i x A d d i t i o n ( double ∗∗mat r ix1 , double ∗∗mat r ix2 , i n t m a t r i x S i z e ){

/ / A l l o c a t e s memory f o r a m a t r i x o f d o u b l e s .i n t i , j ;double ∗∗m a t r i x O u t = ( double ∗∗ ) m a l loc ( m a t r i x S i z e ∗ s i z e o f ( double ∗ ) ) ;f o r ( i = 0 ; i < m a t r i x S i z e ; i ++){


/ / F i l l each c e l l o f t h e m a t r i x o u t p u t .f o r ( i = 0 ; i < m a t r i x S i z e ; i ++){

f o r ( j = 0 ; j < m a t r i x S i z e ; j ++){m a t r i x O u t [ i ] [ j ] = m a t r i x 1 [ i ] [ j ] + m a t r i x 2 [ i ] [ j ] ;

}}


/ / M u l t i p l y a sq ua re m a t r i x by a v e c t o r . R e tu rn n u l l i f f a i l u r e .double ∗ v e c t o r M u l t i p l y ( double ∗∗mat r ix , double ∗ v e c t o r , i n t m a t r i x S i z e , i n t v e c t o r S i z e ){

double ∗ r e s u l t = ( double ∗ ) ma l l oc ( m a t r i x S i z e ∗ s i z e o f ( double ) ) ;

i f ( v e c t o r S i z e != m a t r i x S i z e ){re turn NULL;

}

i n t i , j ;double sum = 0 . 0 ;/ / M u l t i p l i c a t i o n .f o r ( i = 0 ; i < m a t r i x S i z e ; i ++){

f o r ( j = 0 ; j < m a t r i x S i z e ; j ++){sum += m a t r i x [ i ] [ j ] ∗ v e c t o r [ j ] ;

}r e s u l t [ i ] = sum ;sum = 0 ; / / R e s e t .

}

re turn r e s u l t ;}

/ / Re tu rn t h e t r a n s p o s e o f a sq ua re m a t r i x .double ∗∗ t r a n s p o s e ( double ∗∗mat r ix , i n t m a t r i x S i z e ){

/ / A l l o c a t e s memory f o r a m a t r i x o f d o u b l e s .i n t i , j ;double ∗∗m a t r i x O u t = ( double ∗∗ ) m a l loc ( m a t r i x S i z e ∗ s i z e o f ( double ∗ ) ) ;f o r ( i = 0 ; i < m a t r i x S i z e ; i ++){


/ / T ranspose t h e m a t r i x .f o r ( i = 0 ; i < m a t r i x S i z e ; i ++){

f o r ( j = 0 ; j < m a t r i x S i z e ; j ++){

8

m a t r i x O u t [ i ] [ j ] = m a t r i x [ j ] [ i ] ;}

}


/ / Cr ea t e a r e a l p o s i t i v e −d e f i n i t e m a t r i x .double ∗∗ i n i t i a l i z e ( i n t minValue , i n t maxValue , i n t m a t r i x S i z e ){

/ / A l l o c a t e s memory f o r a m a t r i c e s o f d o u b l e s .i n t i , j ;double ∗∗m a t r i x = ( double ∗∗ ) m a l loc ( m a t r i x S i z e ∗ s i z e o f ( double ∗ ) ) ;double ∗∗ i d e n t i t y = ( double ∗∗ ) m a l l oc ( m a t r i x S i z e ∗ s i z e o f ( double ∗ ) ) ;f o r ( i = 0 ; i < m a t r i x S i z e ; i ++){

m a t r i x [ i ] = ( double ∗ ) ma l l oc ( m a t r i x S i z e ∗ s i z e o f ( double ) ) ;i d e n t i t y [ i ] = ( double ∗ ) ma l lo c ( m a t r i x S i z e ∗ s i z e o f ( double ) ) ;

}

/ / C r e a t e s an upper−t r i a n g u l a r m a t r i x o f random numbers be tween minValue and maxValue ./ / C r e a t e s an i d e n t i t y m a t r i x m u l t i p l i e d by maxValue .double random ;f o r ( i = 0 ; i < m a t r i x S i z e ; i ++){

i d e n t i t y [ i ] [ i ] = maxValue ∗ m a t r i x S i z e ;

f o r ( j = 0 ; j < m a t r i x S i z e ; j ++){

random = ( maxValue − minValue ) ∗( ( double ) r and ( ) / ( double )RAND MAX) + minValue ;

i f ( random == 0 . 0 ) {random = 1 . 0 ; / / Avoid d i v i s i o n by 0 .

}m a t r i x [ i ] [ j ] = random ;

}}

/ / Trans form t o p o s i t i v e −d e f i n i t e .double ∗∗ t r a n s p o s e d = t r a n s p o s e ( ma t r ix , m a t r i x S i z e ) ;m a t r i x = m a t r i x A d d i t i o n ( ma t r i x , t r a n s p o s e d , m a t r i x S i z e ) ;m a t r i x = m a t r i x A d d i t i o n ( ma t r i x , i d e n t i t y , m a t r i x S i z e ) ;

re turn m a t r i x ;}/ / Computes t h e sum o f A b s o l u t e e r r o r be tween 2 v e c t o r sdouble vec torComputeSumofAbsError ( double ∗ v e c t o r 1 , double ∗ v e c t o r 2 , i n t s i z e ){

i n t i ;double sumOfAbsError = 0 ;f o r ( i = 0 ; i< s i z e ; i ++){

sumOfAbsError += f a b s ( v e c t o r 2 [ i ] − v e c t o r 1 [ i ] ) ;}

re turn sumOfAbsError ;}

/ / Computes t h e sum o f A b s o l u t e e r r o r be tween 2 m a t r i c e svoid ComputeSumOfAbsError ( double ∗∗ m a t r i x 1 , double ∗∗ mat r ix2 , i n t s i z e )

9

{p r i n t f ( ” Ma t r i x 1 :\ n ” ) ;i n t i , j ;double sumOfAbsError = 0 ;f o r ( i = 0 ; i< s i z e ; i ++){

f o r ( j = 0 ; j< s i z e ; j ++){

sumOfAbsError += f a b s ( m a t r i x 1 [ i ] [ j ] − m a t r i x 2 [ i ] [ j ] ) ;}

}p r i n t f ( ” The sum of a b s o l u t e e r r o r i s %10.6 f \n ” , sumOfAbsError ) ;

}

void p r i n t V e c t o r ( double ∗ v e c t o r , i n t s i z e ){i n t i ;

f o r ( i = 0 ; i < s i z e ; i ++){p r i n t f ( ”\ t %10.6 f ” , v e c t o r [ i ] ) ;

p r i n t f ( ”\n ” ) ;}

p r i n t f ( ”\n ” ) ;}

double ∗∗ i n i t M a t r i x ( i n t s i z e ){double ∗∗m a t r i x = ( double ∗∗ ) m a l loc ( s i z e ∗ s i z e o f ( double ∗ ) ) ;

i n t i ;f o r ( i = 0 ; i < s i z e ; i ++)

m a t r i x [ i ] = ( double ∗ ) ma l l oc ( s i z e ∗ s i z e o f ( double ) ) ;

re turn m a t r i x ;}

void t r a n s C o p y ( double ∗∗ sou rce , double ∗∗ d e s t , i n t s i z e ){i n t i , j ;

f o r ( i = 0 ; i < s i z e ; i ++){f o r ( j = 0 ; j <= i ; j ++){

d e s t [ i ] [ j ] = s o u r c e [ i ] [ j ] ;}

}}

void copyMat r ix ( double ∗∗ sou rce , double ∗∗ d e s t , i n t s i z e ){i n t i , j ;

f o r ( i = 0 ; i < s i z e ; i ++){f o r ( j = 0 ; j < s i z e ; j ++){

d e s t [ i ] [ j ] = s o u r c e [ i ] [ j ] ;}

}};

10

B. cholSerial.c - Serial Cholesky

# i n c l u d e ” c h o l S e r i a l . h ”

double ∗∗ c h o l S e r i a l ( double ∗∗ A, i n t n ){/ / Copy m a t r i x A and t a k e o n l y lower t r i a n g u l a r p a r tdouble ∗∗ L = i n i t M a t r i x ( n ) ;t r a n s C o p y (A, L , n ) ;

i n t i , j , k ;f o r ( j = 0 ; j < n ; j ++){

f o r ( k = 0 ; k < j ; k ++){/ / I n n e r sumf o r ( i = j ; i < n ; i ++){

L [ i ] [ j ] = L [ i ] [ j ] − L [ i ] [ k ] ∗ L [ j ] [ k ] ;}

}

L [ j ] [ j ] = s q r t ( L [ j ] [ j ] ) ;

f o r ( i = j +1 ; i < n ; i ++){L [ i ] [ j ] = L [ i ] [ j ] / L [ j ] [ j ] ;

}}

re turn L ;};

11

C. cholOMP.c - OpenMP Cholesky

# i n c l u d e ” m a t r i x . h ”# i n c l u d e <omp . h>

double ∗∗ cholOMP ( double ∗∗ L , i n t n ) {/ / Warning : a c t s d i r e c t l y on g i v e n m a t r i x !

i n t i , j , k ;omp lock t w r i t e l o c k ;o m p i n i t l o c k (& w r i t e l o c k ) ;

f o r ( j = 0 ; j < n ; j ++) {

f o r ( i = 0 ; i < j ; i ++){L [ i ] [ j ] = 0 ;

}

# pragma omp p a r a l l e l f o r s h a r e d ( L ) p r i v a t e ( k )f o r ( k = 0 ; k < i ; k ++) {

omp se t lock (& w r i t e l o c k ) ;L [ j ] [ j ] = L [ j ] [ j ] − L [ j ] [ k ] ∗ L [ j ] [ k ] ; / / C r i t i c a l s e c t i o n .omp unse t lock (& w r i t e l o c k ) ;

}

# pragma omp s i n g l eL [ i ] [ i ] = s q r t ( L [ j ] [ j ] ) ;

# pragma omp p a r a l l e l f o r s h a r e d ( L ) p r i v a t e ( i , k )f o r ( i = j +1 ; i < n ; i ++) {

f o r ( k = 0 ; k < j ; k ++) {L [ i ] [ j ] = L [ i ] [ j ] − L [ i ] [ k ] ∗ L [ j ] [ k ] ;

}L [ i ] [ j ] = L [ i ] [ j ] / L [ j ] [ j ] ;

}

o m p d e s t r o y l o c k (& w r i t e l o c k ) ;}

re turn L ;};

12

D. cholMPI.c - OpenMPI Cholesky

# i n c l u d e <mpi . h># i n c l u d e ” m a t r i x . h ”

void cholMPI ( double ∗∗ A, double ∗∗ L , i n t n , i n t argc , char ∗∗ a rgv ){/ / Warning : cholMPI ( ) a c t s d i r e c t l y on t h e g i v e n m a t r i x !i n t npes , r ank ;M P I I n i t (& argc , &argv ) ;MPI Comm size (MPI COMM WORLD, &npes ) ;MPI Comm rank (MPI COMM WORLD, &rank ) ;

double s t a r t , end ;MPI Bar r i e r (MPI COMM WORLD) ; /∗ Timing ∗ /i f ( r ank == 0) {

s t a r t = MPI Wtime ( ) ;

/∗ / / T e s tp r i n t f (”A = \n ” ) ;p r i n t ( L , n ) ; ∗ /

}

/ / For each columni n t i , j , k ;f o r ( j = 0 ; j < n ; j ++) {

/∗∗ S t e p 0 :∗ Rep lace t h e e n t r i e s above t h e d i a g o n a l w i t h z e r o e s∗ /

i f ( r ank == 0) {f o r ( i = 0 ; i < j ; i ++) {

L [ i ] [ j ] = 0 . 0 ;}

}

/∗∗ S t e p 1 :∗ Update t h e d i a g o n a l e l e m e n t∗ /

i f ( j%npes == rank ) {

f o r ( k = 0 ; k < j ; k ++) {L [ j ] [ j ] = L [ j ] [ j ] − L [ j ] [ k ] ∗ L [ j ] [ k ] ;

}

L [ j ] [ j ] = s q r t ( L [ j ] [ j ] ) ;}

/ / B r o a d c a s t row w i t h new v a l u e s t o o t h e r p r o c e s s e sMPI Bcast ( L [ j ] , n , MPI DOUBLE, j%npes , MPI COMM WORLD) ;

/∗∗ S t e p 2 :∗ Update t h e e l e m e n t s below t h e d i a g o n a l e l e m e n t∗ /

/ / D i v i d e t h e r e s t o f t h e work

13

f o r ( i = j +1 ; i < n ; i ++) {i f ( i%npes == rank ) {

f o r ( k = 0 ; k < j ; k ++) {L [ i ] [ j ] = L [ i ] [ j ] − L [ i ] [ k ] ∗ L [ j ] [ k ] ;

}

L [ i ] [ j ] = L [ i ] [ j ] / L [ j ] [ j ] ;}

}}

MPI Bar r i e r (MPI COMM WORLD) ; /∗ Timing ∗ /i f ( r ank == 0){

end = MPI Wtime ( ) ;p r i n t f ( ” T e s t i n g OpenMpi i m p l e m e n t a t i o n Outpu t : \n ” ) ;p r i n t f ( ” Runtime = %l f \n ” , end− s t a r t ) ;p r i n t f ( ” T e s t i n g MPI i m p l e m e n t a t i o n Outpu t : ” ) ;t e s t B a s i c O u t p u t (A, L , n ) ;

/ / T e s t/∗ do ub l e ∗∗ LLT = m a t r i x M u l t i p l y ( L , t r a n s p o s e ( L , n ) , n ) ;p r i n t f (”L∗L ˆ T = \n ” ) ;p r i n t ( LLT , n ) ; ∗ /

}

M P I F i n a l i z e ( ) ;}i n t t e s t B a s i c O u t p u t ( double ∗∗A, double ∗∗ L , i n t n ){

double ∗∗ LLT = m a t r i x M u l t i p l y ( L , t r a n s p o s e ( L , n ) , n ) ;

i n t i , j ;f l o a t p r e c i s i o n = 0 . 0 0 0 0 0 0 1 ;f o r ( i = 0 ; i < n ; i ++){

f o r ( j = 0 ; j < n ; j ++){i f ( ! ( abs (LLT[ i ] [ j ] − A[ i ] [ j ] ) < p r e c i s i o n ) ){

p r i n t f ( ”FAILED\n ” ) ;ComputeSumOfAbsError (A, LLT , n ) ;re turn 0 ;}

}}p r i n t f ( ”PASSED\n ” ) ;re turn 1 ;

};

14

E. tests.c - General Test Code

# i n c l u d e <s t d i o . h># i n c l u d e < s t r i n g . h># i n c l u d e <math . h># i n c l u d e < f l o a t . h># i n c l u d e <t ime . h># i n c l u d e < s t d l i b . h># i n c l u d e <t ime . h># i n c l u d e <omp . h>


t y p e d e f i n t boo l ;enum { f a l s e , t r u e } ;s t r u c t t i m e s p e c b e g i n ={0 ,0} , end ={0 ,0} ;

t i m e t s t a r t , s t o p ;

i n t main ( i n t argc , char ∗∗ a rgv ){

/ / g e n e r a t e seeds r a n d ( t ime (NULL ) ) ;

i f ( a r g c != 3)

{p r i n t f ( ”You d i d n o t f e e d me arguments , I w i l l d i e now : ( . . . \n ” ) ;p r i n t f ( ” Usage : %s [ m a t r i x s i z e ] [ number o f t h r e a d s ] \n ” , a rgv [ 0 ] ) ;re turn 1 ;

}i n t m a t r i x S i z e = a t o i ( a rgv [ 1 ] ) ;i n t t h readsNumber = a t o i ( a rgv [ 2 ] ) ;

p r i n t f ( ” T e s t b a s i c o u t p u t f o r a m a t r i x o f s i z e %d :\ n ” , m a t r i x S i z e ) ;/ / Genera te random SPD m a t r i xdouble ∗∗ A = i n i t i a l i z e ( 0 , 10 , m a t r i x S i z e ) ;/∗ p r i n t f (” Chol m a t r i x \n ” ) ;

p r i n t ( A , m a t r i x S i z e ) ; ∗ /double ∗∗L = i n i t i a l i z e ( 0 , 10 , m a t r i x S i z e ) ;

/ / T e s t S e r i a l Program/ / Apply S e r i a l C h o l e s k yp r i n t f ( ” T e s t i n g S e r i a l i m p l e m e n t a t i o n Outpu t : \n ” ) ;

c l o c k g e t t i m e (CLOCK MONOTONIC, &b e g i n ) ;L = c h o l S e r i a l (A, m a t r i x S i z e ) ;c l o c k g e t t i m e (CLOCK MONOTONIC, &end ) ; / / Get t h e c u r r e n t t i m e .

t e s t B a s i c O u t p u t O f C h o l (A, L , m a t r i x S i z e ) ;/ / T e s t e x e c u t i o n t i m ep r i n t f ( ” The s e r i a l c o m p u t a t i o n took %.5 f s e c o n d s \n ” ,

( ( double ) end . t v s e c + 1 . 0 e−9 ∗ end . t v n s e c ) −( ( double ) b e g i n . t v s e c + 1 . 0 e−9 ∗ b e g i n . t v n s e c ) ) ;

/ / T e s t i n g OpenMP Programp r i n t f ( ” T e s t i n g OpenMP i m p l e m e n t a t i o n Outpu t : \n ” ) ;omp se t num threads ( th readsNumber ) ;

15

copyMat r ix (A, L , m a t r i x S i z e ) ;c l o c k g e t t i m e (CLOCK MONOTONIC, &b e g i n ) ;cholOMP ( L , m a t r i x S i z e ) ;c l o c k g e t t i m e (CLOCK MONOTONIC, &end ) ; / / Get t h e c u r r e n t t i m e .t e s t B a s i c O u t p u t O f C h o l (A, L , m a t r i x S i z e ) ;

/ / T e s t e x e c u t i o n t i m ep r i n t f ( ” The OpenMP c o m p u t a t i o n took %.5 f s e c o n d s \n ” ,

( ( double ) end . t v s e c + 1 . 0 e−9 ∗ end . t v n s e c ) −( ( double ) b e g i n . t v s e c + 1 . 0 e−9 ∗ b e g i n . t v n s e c ) ) ;

p r i n t f ( ”\n ” ) ;re turn 0 ;

}i n t t e s t B a s i c O u t p u t O f C h o l ( double ∗∗A, double ∗∗ L , i n t n ){

double ∗∗ LLT = m a t r i x M u l t i p l y ( L , t r a n s p o s e ( L , n ) , n ) ;

i n t i , j ;f l o a t p r e c i s i o n = 0 .00000000001 ;f o r ( i = 0 ; i < n ; i ++){

f o r ( j = 0 ; j < n ; j ++){i f ( ! ( abs (LLT[ i ] [ j ] − A[ i ] [ j ] ) < p r e c i s i o n ) ){

p r i n t f ( ”FAILED\n ” ) ; / / i f i t f a i l s show t h e e r r o rComputeSumOfAbsError (A, LLT , n ) ;re turn 0 ;}

}}p r i n t f ( ”PASSED\n ” ) ;re turn 1 ;

}

void t e s t T i m e f o r S e r i a l C h o l ( i n t n ){

p r i n t f ( ” T e s t d u r a t i o n f o r s e r i a l v e r s i o n wi th m a t r i x o f s i z e %d \n ” , n ) ;/ / Genera te random SPD m a t r i xdouble ∗∗ A = i n i t i a l i z e ( 0 , 10 , n ) ;c l o c k t s t a r t = c l o c k ( ) ;/ / Apply C h o l e s k ydouble ∗∗ L = c h o l S e r i a l (A, n ) ;c l o c k t end = c l o c k ( ) ;f l o a t s e c o n d s = ( f l o a t ) ( end − s t a r t ) / CLOCKS PER SEC ;p r i n t f ( ” I t t ook %f s e c o n d s \n ” , s e c o n d s ) ;

}

void t e s t E r r o r O f L i n e a r S y s t e m A p p l i c a t i o n ( i n t m a t r i x S i z e ){

p r i n t f ( ” T e s t l i n e a r sys tem a p p l i c a t i o n o f Cholesky f o r m a t r i x s i z e %d :\ n ” ,m a t r i x S i z e ) ;

double ∗∗ A = i n i t i a l i z e ( 0 , 10 , m a t r i x S i z e ) ;double ∗xTheo = ( double ∗ ) ma l l oc ( m a t r i x S i z e ∗ s i z e o f ( double ) ) ;i n t i n d e x ;

f o r ( i n d e x = 0 ; i n d e x < m a t r i x S i z e ; i n d e x ++){

16

xTheo [ i n d e x ] = rand ( ) / ( double ) RAND MAX ∗10 ;}double ∗b = v e c t o r M u l t i p l y (A, xTheo , m a t r i x S i z e , m a t r i x S i z e ) ;/ / Apply C h o l e s k ydouble ∗∗ L = c h o l S e r i a l (A, m a t r i x S i z e ) ;

double ∗y = ( double ∗ ) ma l lo c ( m a t r i x S i z e ∗ s i z e o f ( double ) ) ;/ / Forward−s u b s t i t u t i o n p a r t

i n t i , j ;f o r ( i = 0 ; i < m a t r i x S i z e ; i ++){

y [ i ] = b [ i ] ;f o r ( j = 0 ; j < i ; j ++){

y [ i ] = y [ i ] − L [ i ] [ j ] ∗ y [ j ] ;}

y [ i ] = y [ i ] / L [ i ] [ i ] ;}

/ / Back−s u b s t i t u t i o n p a r tdouble ∗∗LT = t r a n s p o s e ( L , m a t r i x S i z e ) ;double ∗xExpr = ( double ∗ ) ma l lo c ( m a t r i x S i z e ∗ s i z e o f ( double ) ) ;f o r ( i = m a t r i x S i z e − 1 ; i >=0; i−−){

xExpr [ i ] = y [ i ] ;f o r ( j = i + 1 ; j < m a t r i x S i z e ; j ++){

xExpr [ i ] = xExpr [ i ] − LT [ i ] [ j ]∗ xExpr [ j ] ;}xExpr [ i ] = xExpr [ i ] / LT [ i ] [ i ] ;

}p r i n t f ( ” x e x p e r i m e n t a l i s : \n ” ) ;p r i n t V e c t o r ( xExpr , m a t r i x S i z e ) ;p r i n t f ( ” The sum of abs e r r o r i s %10.6 f \n ” ,

vec torComputeSumofAbsError ( xTheo , xExpr , m a t r i x S i z e ) ) ;};

17

F. testMPI.c - Test program for MPI implementation


i n t main ( i n t argc , char ∗∗ a rgv ){

/ / g e n e r a t e seeds r a n d ( t ime (NULL ) ) ;i f ( a r g c != 2)

{p r i n t f ( ”You d i d n o t f e e d me arguments , I w i l l d i e now : ( . . . \n ” ) ;p r i n t f ( ” Usage : %s [ m a t r i x s i z e ] \n ” , a rgv [ 0 ] ) ;re turn 1 ;

}i n t m a t r i x S i z e = a t o i ( a rgv [ 1 ] ) ;

/ / Genera te random SPD m a t r i xdouble ∗∗ A = i n i t i a l i z e ( 0 , 10 , m a t r i x S i z e ) ;/∗ p r i n t f (” Chol m a t r i x \n ” ) ;

p r i n t ( A , m a t r i x S i z e ) ; ∗ /double ∗∗L = i n i t i a l i z e ( 0 , 10 , m a t r i x S i z e ) ;

/ / T e s t i n g OpenMpi Program

copyMat r ix (A, L , m a t r i x S i z e ) ;cholMPI (A, L , m a t r i x S i z e , a rgc , a rgv ) ;

/ / Warning : cholMPI ( ) a c t s d i r e c t l y on t h e g i v e n m a t r i x L

};

Documents

ECSE 420 - Parallel Cholesky Algorithm - Report