Solving Systems of Linear Equations on the CELL Processor ...eprints.ma.man.ac.uk/821/01/Solving_Systems.pdf · solving a symmetric positive deﬁnite system of lin-ear equation in

Solving Systems of Linear Equations on theCELL Processor Using Cholesky Factorization

Kurzak, Jakub and Buttari, Alfredo and Dongarra, Jack

2007

MIMS EPrint: 2007.95

Manchester Institute for Mathematical SciencesSchool of Mathematics

The University of Manchester

Reports available from: http://eprints.maths.manchester.ac.uk/And by contacting: The MIMS Secretary

School of Mathematics

The University of Manchester

Manchester, M13 9PL, UK

ISSN 1749-9097

http://eprints.maths.manchester.ac.uk/

Solving Systems of Linear Equations on the CELL ProcessorUsing Cholesky Factorization – LAPACK Working Note 184

Jakub Kurzak1, Alfredo Buttari1, Jack Dongarra1,2

1Department of Computer Science, University Tennessee,Knoxville, Tennessee 37996

2Computer Science and Mathematics Division, Oak Ridge National Laboratory,Oak Ridge, Tennessee, 37831

May 10, 2007

ABSTRACT: The STI CELL processor introducespioneering solutions in processor architecture. At thesame time it presents new challenges for the devel-opment of numerical algorithms. One is effective ex-ploitation of the differential between the speed of sin-gle and double precision arithmetic; the other is effi-cient parallelization between the short vector SIMDcores. In this work, the first challenge is addressedby utilizing a mixed-precision algorithm for the solu-tion of a dense symmetric positive definite system oflinear equations, which delivers double precision ac-curacy, while performing the bulk of the work in sin-gle precision. The second challenge is approached byintroducing much finer granularity of parallelizationthan has been used for other architectures and us-ing a lightweight decentralized synchronization. Theimplementation of the computationally intensive sec-tions gets within 90 percent of peak floating pointperformance, while the implementation of the mem-ory intensive sections reaches within 90 percent ofpeak memory bandwidth. On a single CELL pro-cessor, the algorithm achieves over 170 Gflop/s whensolving a symmetric positive definite system of lin-ear equation in single precision and over 150 Gflop/swhen delivering the result in double precision accu-racy.KEYWORDS: CELL BE, iterative refinement,mixed-precision algorithms, Cholesky factorization

1 Motivation

In numerical computing, there is a fundamental per-formance advantage of using single precision float-ing point data format over double precision data for-mat, due to more compact representation, thanks towhich, twice the number of single precision data ele-ments can be stored at each stage of the memory hi-erarchy. Short vector SIMD processing provides yetmore potential for performance gains from using sin-gle precision arithmetic over double precision. Sincethe goal is to process the entire vector in a singleoperation, the computation throughput can be dou-bled when the data representation is halved. Unfor-tunately, the accuracy of the solution is also halved.

Most of the processor architectures available to-day have been at some point augmented with shortvector SIMD extensions. Examples include Stream-ing SIMD Extensions (SSE) for the AMD and Intellines of processors, PowerPC Velocity Engine / Al-tiVec / VMX, Sparc Visual Instruction Set (VIS),Alpha Motion Video Instructions (MVI), PA-RISCMultimedia Acceleration eXtensions (MAX), MIPS-3D Application Specific Extensions (ASP), and Dig-ital Media Extensions (MDMX), ARM NEON. Thedifferent architectures exhibit big differences in theircapabilities. The vector size is either 64 bits or, morecommonly, 128 bits. The register file size ranges from

1

just a few to as many as 128 registers. Some exten-sions only support integer types, others also operateon single precision floating point numbers, and yetothers also process double precision values.

Today the Synergistic Processing Element (SPE)of the STI CELL processor [1–3] can probably beconsidered the state of the art in short vector SIMDprocessing. Possessing 128-byte long registers and afully pipelined, fused add-multiply instruction, it iscapable of completing eight single precision floatingpoint operations each clock cycle, which combinedwith the size of the register file of 128 registers de-livers close to peak performance on many commonworkloads. At the same time, built with multime-dia and embedded applications in mind, the currentincarnation of the CELL architecture does not imple-ment double precision arithmetic on par with singleprecision performance, which makes the processor avery attractive target for exploring mixed-precisionapproaches.

Another important phenomenon in recent yearshas been the gradual shift of focus in processor ar-chitecture from aggressive exploitation of instructionlevel parallelism towards thread-level parallelism, re-sulting in the introduction of chips with multiple pro-cessing units commonly referred to as multi-core pro-cessors. The new architectures deliver the much de-sired improvement in performance, and at the sametime challenge the scalability of existing algorithms,and force the programmers to seek more parallelismby going to much finer levels of problem granular-ity. In linear algebra, it enforces the departure fromthe model relying on parallelism encapsulated at thelevel of BLAS and shifts to more flexible methods ofscheduling work.

2 Related Work

Iterative refinement is a well known method for im-proving the solution of a linear system of equations ofthe form Ax = b. Typically, a dense system of linearequations is solved by applying a factorization to thecoefficient matrix, followed by a back solve. Due toroundoff errors, the solution carries an error relatedto the condition number of the coefficient matrix. In

order to improve the computed solution, an iterativerefinement process can be applied, which produces acorrection to the computed solution at each iteration.In principle, the algorithm can produce a solutioncorrect to the working precision.

Iterative refinement is a fairly well understood con-cept and was analyzed by Wilkinson [4], Moler [5]and Stewart [6]. Higham gives error bounds for bothsingle and double precision iterative refinement algo-rithms, where the entire algorithm is implementedwith the same precision (single or double respec-tively) [7]. He also gives error bounds in single preci-sion arithmetic, with refinement performed in doubleprecision arithmetic. Error analysis for the case de-scribed in this work, where the factorization is per-formed in single precision and the refinement in dou-ble precision, is given by Langou et al. [8].

The authors of this work have previously presentedan initial implementation of the mixed-precision algo-rithm for the general, non-symmetric, case using LUfactorization on the CELL processors. Although re-spectable performance numbers were presented, boththe factorization and the refinement steps relied onrather classic parallelization approaches. Also, asomewhat general discussion of algorithmic and im-plementation details was presented. This work ex-tends the previous presentation by introducing anovel scheme for parallelization of the computationalcomponents of the algorithm, and also describesin much more detail the implementation of bothcomputation-intensive, as well as memory-intensiveoperations.

3 Algorithm

The standard approach to solving symmetric posi-tive definite systems of linear equations is to use theCholesky factorization. The Cholesky factorizationof a real symmetric positive definite matrix A hasthe form A = LLT , where L is a real lower triangularmatrix with positive diagonal elements. The systemis solved by first solving Ly = b (forward substitution)and then solving LT x = y (backward substitution). Inorder to improve the accuracy of the computed solu-tion, an iterative refinement process is applied, which

2

Algorithm 1 Solution of a symmetric positive defi-nite system of linear equations using mixed-precisioniterative refinement based on Cholesky factorization.1: A(32), b(32) ← A, b2: L(32), L

T(32) ←SPOTRFa(A(32))

3: x(1)(32) ←SPOTRSb(L(32), L

T(32), b(32))

4: x(1) ← x(1)(32)

5: repeat6: r(i) ← b−Ax(i)

7: r(i)(32) ← r(i)

8: z(i)(32) ←SPOTRSb(L(32), L

T(32), r

(i)(32))

9: z(i) ← z(i)(32)

10: x(i+1) ← x(i) + z(i)

11: until x(i) is accurate enoughaLAPACK name for Cholesky factorizationbLAPACK name for symmetric back solve

64-bit representation is used in all cases where 32-bit repre-sentation is not indicated by a subscript.

produces a correction to the computed solution, x.The mixed-precision iterative refinement algorithm

using Cholesky factorization is outlined by Algo-rithm 1. The factorization A = LLT (line 2) andthe solution of the triangular systems Ly = b andLT x = y (lines 3 and 8) are computed using singleprecision arithmetic. The residual calculation (line 6)and the update of the solution (line 10) are com-puted using double precision arithmetic and the orig-inal double precision coefficients. The most compu-tationally expensive operations, including the factor-ization of the coefficient matrix A and the forwardand backward substitution, are performed using sin-gle precision arithmetic and they take advantage ofthe single precision speed. The only operations exe-cuted in double precision are the residual calculationand the update of the solution.

It can be observed that all operations of O(n3)complexity are handled in single precision, and alloperations performed in double precision are of atmost O(n2) complexity. The coefficient matrix A isconverted to single precision for the LU factorization.At the same time, the original matrix in double pre-cision is preserved for the residual calculation.

The algorithm described above, and shown on Al-gorithm 1 is available in the LAPACK 3.1 library andimplemented by the routine DSGESV.

4 Implementation

4.1 Essential Hardware Features

An extensive hardware overview would be beyondthe scope of this publication. Vast amounts ofinformation are publicly available for both experi-enced programmers [9], as well as newcomers to thefield [10, 11]. It is assumed that the reader has somefamiliarity with the architecture. Here, the featuresare mentioned that have the most influence on theimplementation presented.

The CELL is a multi-core chip that includes ninedifferent processing elements. One core, the POWERProcessing Element (PPE), represents a standardprocessor design implementing the PowerPC instruc-tion set. The remaining eight cores, the Syner-gistic Processing Elements (SPEs), are short vec-tor Single Instruction Multiple Data (SIMD) engineswith big register files of 128 128-bit vector registersand 256 KB of local memory, referred to as localstore (LS).

Although standard C code can be compiled forthe execution on the SPEs, the SPEs do not exe-cute scalar code efficiently. For efficient execution,the code has to be vectorized in the SIMD sense, byusing C language vector extensions (intrinsics), or byusing assembly code. The system’s main memory isaccessible to the PPE through L1 and L2 caches andto the SPEs through DMA engines associated withthem. The SPEs can only execute code residing inthe local store and can only operate on data in thelocal store. All data has to be transferred in and outof local store via DMA transfers.

The theoretical computing power of a single SPEis 25.6 Gflop/s in single precision and roughly1.8 Gflop/s in double precision. Floating point arith-metic follows the IEEE format, with double precisionoperations complying numerically with the standardand single precision providing only rounding towardzero. The theoretical communication speed for a sin-

3

A

C

A

B C

T TT

T = T – A AT

SSYRK

T = LLT

SPOTRFC = C – B AT

SGEMM

C = C \ TSTRSM

A T

C

A

B C

TT

Figure 1: Top - steps of left-looking Cholesky factorization. Bottom - tiling of operations.

gle SPE is 25.6 GB/s. The theoretical peak band-width of the main memory is 25.6 GB/s as well.

The size of the register file and the size of the lo-cal store dictate the size of the elementary operationsubject to scheduling to the SPEs. The ratio of com-puting power to the memory bandwidth dictates theoverall problem decomposition for parallel execution.

4.2 Factorization

A few varieties of Cholesky factorization are known.In particular, the right-looking variant and theleft-looking variant [12]. It has also been pointed outthat those variants are borders of a continuous spec-trum of possible execution paths [13].

Generally, the left-looking factorization is preferredfor several reasons. During the factorization, mosttime is spent calculating a matrix-matrix product. Inthe case of the right-looking factorization, this prod-uct involves a triangular matrix. In the case of theleft-looking factorization, this product only involvesrectangular matrices. It is generally more efficientto implement a matrix-matrix product for rectan-gular matrices and it is easier to balance the loadin parallel execution. The implementation presentedhere is derived from the left-looking formulation of

the Cholesky factorization, which follows the imple-mentation of the LAPACK routine SPOTRF.

Due to the limited size of the local store, all numeri-cal operations have to be decomposed into elementaryoperations that fit in the local store. The simplicityof implementing the Cholesky factorization lies in thefact that it can be easily decomposed into tile opera-tions - operations on fixed-size submatrices that takefrom one to three tiles as input and produce one tileas output. These elementary operations will furtherbe referred to as tile kernels. Figure 1 illustrates thesteps of the left-looking Cholesky factorization andhow each step is broken down to tile operations.

4.2.1 Computational Kernels

Implementation of the tile kernels assumes a fixedsize of the tiles. Smaller tiles (finer granularity) havea positive effect on scheduling for parallel executionand facilitate better load balance and higher parallelefficiency. Bigger tiles provide better performance insequential execution on a single SPE.

In the case of the CELL chip, the crossover pointis rather simple to find for problems in dense linearalgebra. From the standpoint of this work, the mostimportant operation is matrix multiplication in sin-

4

Operation BLAS / LAPACK Call

T ← T −A×ATcblas ssyrk(CblasRowMajor,

CblasLower, CblasNoTrans,64, 64, 1.0, A, 64, 1.0, T, 64);

C ← C −B ×ATcblas sgemm(CblasRowMajor,

CblasNoTrans, CblasTrans,64, 64, 64,1.0, B, 64, A, 64, 1.0, C, 64);

B ← B × T−Tcblas strsm(CblasRowMajor,

CblasRight, CblasLower,(B = B/TT )a

CblasTrans, CblasNonUnit,64, 64, 1.0, T, 64, B, 64);

T ← L× LTlapack spotrf(lapack lower,

64, trans(T), 64, &info);b

ausing MATLAB notationbusing LAPACK C interface by R. Delmas and J. Langou,

http://icl.cs.utk.edu/∼delmas/lapwrapc.html

Table 1: Single precision Cholesky tile kernels.

gle precision. It turns out that this operation canachieve within a few percent off the peak perfor-mance on a single SPE for matrices of size 64×64.The fact that the peak performance can be achievedfor a tile of such a small size has to be attributedto the large size of the register file and fast accessto the local store, undisturbed with any intermedi-ate levels of memory. Also, such a matrix occupies a16 KB block of memory, which is the maximum sizeof a single DMA transfer. Eight such matrices fit inhalf of the local store providing enough flexibility formultibuffering and, at the same time, leaving enoughroom for the code. Discussion of tile size considera-tion was also presented before by the authors of thispublication [14]. Table 1 represents the Cholesky tilekernels for tile size of 64×64 as BLAS and LAPACKcalls.

It has already been pointed out that a few deriva-tions of the Cholesky factorization are known, in par-ticular the right-looking variant and the left-lookingvariant [12]. Dynamic scheduling of operations is an-other possibility. However, no matter which staticvariant is used, or whether dynamic execution is used,the same set of tile kernels is needed. The changefrom one to another will only alter the order in whichthe tile operations execute.

All the kernels were developed using SIMD C lan-guage extensions. Extra effort was invested into opti-

mizing the matrix multiplication (SGEMM1) kernelperforming the calculation C = C − B × AT , sincethis operation dominates the execution time. All thetile kernels were written by consistently applying theidea of blocking with a block size of four. A shortdiscussion of each case follows.

The SSYRK kernel applies a symmetric rank-k up-date to a tile. In other words, it performs the opera-tion T ← T −A×AT , where only the lower triangu-lar part of T is modified. The SSYRK kernel consistsof two nested loops, where in each inner iteration a4×4 block of the output tile is produced and the bodyof the inner loop is completely unrolled. The #defineand nested #define directives are used to create a sin-gle basic block - a block of straight line code. Staticoffsets are used within the basic block, and pointerarithmetic is used to advance pointers between iter-ations.

Algorithm 2 SSYRK tile kernel T ← T −A×AT .1: for j = 0 to 15 do2: for i = 0 to j − 1 do3: Compute block [j,i]4: Permute/reduce block [j,i]5: end for6: Compute block [j,j]7: Permute/reduce block [j,j]8: end for

block is a 4×4 submatrix of tile T .

The construction of the SSYRK tile kernel is pre-sented by Algorithm 2. The unrolled code consistsof four distinct segments. A computation segment(line 3) includes only multiply and add operations tocalculate a product of two 4×64 blocks of tile A andit produces 16 4-element vectors as output. A per-mutation/reduction segment (line 4) performs trans-positions and reductions on these 16 vectors and de-livers the four 4-element vectors constituting the 4×4block of the output tile. The two segments mentionedabove handle the off-diagonal, square blocks (lines 3and 4), and two more segments handle the triangular,

1Tile kernels are referred to using the names of BLAS andLAPACK routines implementing the same functionality.

5

Algorithm 3 SGEMM tile kernel C ← C −B ×AT

1: Compute block 02: for i = 1 to 127 do3: Permute/reduce blk 2i−2 & compute blk 2i−14: Permute/reduce blk 2i− 1 & compute blk 2i5: end for6: Permute/reduce blk 254 & compute blk 2557: Permute/reduce blk 255

blk is a 4×4 submatrix of tile C.

diagonal blocks (lines 6 and 7). It is an elegant andcompact, but suboptimal design.

The SGEMM kernel performs the operationC ← C −B ×AT , which is very similar to the opera-tion performed by the SSYRK kernel. One differenceis that it updates the tile C with a product of twotiles, B and AT , and not the tile A and its transpose.The second difference is that the output is the entiretile and not just its lower triangular portion. Also,since the SGEMM kernel is performance-critical, it issubject to more optimizations compared to the SSYKkernel.

The main idea behind additional optimization ofthe SGEMM kernel comes from the fact that the SPEis a dual issue architecture, where arithmetic opera-tions can execute in parallel with permutations ofvector elements. Thanks to this fact, a pipelined ex-ecution can be implemented, where the operations ofthe permute/reduce segment from iteration k can bemixed with the operations of the compute segmentfrom iteration k + 1. The two nested loops used forSSYRK are replaced with a single loop, where the256 4×4 blocks of the output tile are produced in alinear row-major order, which results in Algorithm 3.

Algorithm 4 STRSM tile kernel B ← B × T−T .1: for j = 0 to 15 do2: for i = 0 to j − 1 do3: Apply block i towards block j4: end for5: Solve block j6: end for

block is a 64×4 submatrix of tile B.

The STRSM kernel computes a triangular solvewith multiple right-hand-sides B ← B × T−T . Thecomputation is conceptually easiest to SIMD’ize ifthe same step is applied at the same time to differentright-hand-sides. This can be easily achieved if thememory layout of tile B is such that each 4-elementvector contains elements of the same index of differ-ent right-hand-sides. Since this is not the case here,each 4×4 block of the tile is transposed, in place, be-fore and after the main loop implementing the solve.The operation introduces a minor overhead, but al-lows for a very efficient implementation of the solve -one which achieves good ratio of the peak with smalland simple code.

Algorithm 4 presents the structure of the code im-plementing the triangular solve, which is an applica-tion of the lazy variant of the algorithm. The choiceof the lazy algorithm versus the aggressive algorithmis arbitrary. The code includes two segments of fullyunrolled code, both of which operate on 64×4 blocksof data. The outer loop segment (line 5) produces a64×4 block j of the solution. The inner loop segment(line 3) applies the update of step i to block j.

Algorithm 5 SPOTRF tile kernel T ← L× LT .1: for k = 0 to 15 do2: for i = 0 to k − 1 do3: SSYRK (apply block [k,i] to block [k,k])4: end for5: SPOTF2 (factorize block [k,k])6: for j = k to 15 do7: for i = 0 to k − 1 do8: SGEMM (apply block [j,i] to block [j,k])9: end for

10: end for11: for j = k to 15 do12: STRSM (apply block [k,k] to block [j,k])13: end for14: end for

block is a 4×4 submatrix of tile T .

The SPOTRF kernel computes the Cholesky fac-torization, T ← L × LT , of a 64×64 tile. This isthe lazy variant of the algorithm, more commonly re-ferred to as the left-looking factorization, where up-

6

Kernel Source Compilation Object Execution Execution FractionKernel Code Code Time Rate of Peak

[LOC] [KB] [µs] [Gflop/s] [%]SSYRK 160 spuxlca -O3 4.7 13.23 20.12 78SGEMM 330 spu-gccb -Os 9.0 22.78 23.01 90STRSM 310 spuxlca -O3 8.2 16.47 15.91 62SPOTRF 340 spu-gccb -O3 4.1 15.16 5.84 23aversion 1.0 (SDK 1.1)bversion 4.0.2 (toolchain 3.2 - SDK 1.1)

Table 2: Performance of single precision Cholesky factorization tile kernels.

dates to the trailing submatrix do not immediatelyfollow panel factorization. Instead, updates are ap-plied to a panel right before factorization of thatpanel.

It could be expected that this kernel is imple-mented using Level 2 BLAS operations, as this is theusual way of implementing panel factorizations. Sucha solution would, however, lead to code being diffi-cult to SIMD’ize and yield very poor performance.Instead, the implementation of this routine is simplyan application of the idea of blocking with block sizeequal to the SIMD vector size of four. Algorithm 5presents the structure of the SPOTRF tile kernel.

Table 2 compares the tile kernels in terms of sourceand object code size and performance. Although per-formance is the most important metric, code size isnot without meaning, due to the limited size of localstore. Despite the fact that code can be replaced inthe local store at runtime, it is desirable that the en-tire code that implements the single precision factor-ization fits into the local store at the same time. Codemotion during the factorization would both compli-cate the code and adversely affect performance.

Although the matrix multiplication achieves quitegood performance - 23 Gflop/s, which is 90 percent ofthe peak, there is no doubt that better performancecould be achieved by using assembly code insteadC language SIMD extensions. Performance in excessof 25 Gflop/s has been reported for similar, althoughnot identical, SGEMM kernels [15]. It is remarkablethat this performance can be achieved for operationsof such small granularity, which has to be attributedto the unique features of the CELL architecture, es-

pecially register file and memory organization.It is worth noting that, although significant effort

was required to optimize the SGEMM kernel (andyet more would be required to further optimize it),the other kernels involved a rather small program-ming effort in a higher level language to deliver sat-isfactory performance (execution time shorter thanexecution time of SGEMM kernel). This means thatthe Pareto principle (also known as the 80-20 rule)(http://en.wikipedia.org/wiki/Pareto principle) ap-plies very well in this case. Only a small portionof the code requires strenuous optimizations for theapplication to deliver close to peak performance.

4.2.2 Parallelization

The presentation of the parallelization of theCholesky factorization needs to be preceded by a dis-cussion of memory bandwidth considerations.

The SGEMM kernel can potentially execute at arate very close to 25.6 Gflop/s on a single SPE. Insuch a case, it performs the 2 × 643 = 524288 oper-ations in 20.48 µs. The operation requires the trans-mission of three tiles from main memory to the localstore (tiles A, B and C) and a transmission of onetile from local store to main memory (updated tileC), consuming the bandwidth equal to:

4tiles × 642 × 4sizeof(float)[B]20.48[µs]

= 3.2[GB/s].

This means that eight SPEs performing the SGEMMoperation at the same time will require the band-width of 8× 3.2GB/s = 25.6GB/s, which is the the-oretical peak main memory bandwidth.

7

It has been shown that arithmetic can execute al-most at the theoretical peak rate on the SPE. At thesame time, it would not be realistic to expect the-oretical peak bandwidth from the memory system.By the same token, data reuse has to be introducedinto the algorithm to decrease the load on the mainmemory. A straightforward solution is to introduce1D processing, where each SPE processes one row oftiles of the coefficient matrix at a time.

Please see Figure 1 for the following discussion.In the SSYRK part, one SPE applies a row of tilesA to the diagonal tile T , followed by the SPOTRFoperation (factorization) of the tile T . Tile T onlyneeds to be read in at the beginning of this step andwritten back at the end. The only transfers takingplace in between are reads of tiles A. Similarly, inthe SGEMM part, one SPE applies a row of tilesA and a row of tiles B to tile C, followed by theSTRSM operation (triangular solve) on tile C usingthe diagonal, triangular tile T . Tile C only needs tobe read in at the beginning of this step and writtenback at the end. Tile T only needs to be read inright before the triangular solve. The only transferstaking place in between are reads of tiles A and B.Such work partitioning approximately halves the loadon the memory system.

It may also be noted that 1D processing is a nat-ural match for the left-looking factorization. In theright-looking factorization, the update to the trailingsubmatrix can easily be partitioned in two dimen-sions. However, in case of the left-looking factoriza-tion, 2D partitioning would not be feasible due to thewrite dependency on the panel blocks (tiles T and C).

1D partitioning introduces a load balancing prob-lem. With work being distributed by block rows, ineach step of the factorization, a number of proces-sors is going to be idle, which is equal to the numberof block rows factorized in a particular step modulothe number of processors. Figure 2 shows three con-secutive steps on a factorization with the processorsbeing occupied and idle in these steps. Such behav-ior is going to put a harsh upper bound on achievableperformance.

It can be observed, however, that at each step ofthe factorization, a substantial amount of work canbe scheduled, to the otherwise idle processors, from

1

2

3

4

2

3 2

1

1

5

6

7

8

4

5

6

7

3

4

5

6

8 7

8

Figure 2: Load imbalance caused by 1D processing.

the upcoming steps of the factorization. The onlyoperations that cannot be scheduled at a given pointin time are those that involve panels that have notbeen factorized yet. This situation is illustrated inFigure 3. Of course, this kind of processing requiresdependency tracking in two dimensions, but since alloperations proceed at the granularity of tiles, thisdoes not pose a problem.

1

2

3

4

6

7 ...

5

8

Figure 3: Pipelining of factorization steps.

In the implementation presented here, all SPEsfollow a static schedule presented in Figure 3, withcyclic distribution of work from the steps of the fac-torization. In this case, a static schedule works well,due to the fact that performance of the SPEs is verydeterministic (unaffected by any non-deterministicphenomena, like cache misses). This way the poten-tial bottleneck of a centralized scheduling mechanismis avoided.

Figure 4 presents the execution chart (Gantt chart)of factorization of a 1024×1024 matrix. Colors cor-respond to the ones in Figure 1. The two shades

8

Figure 4: Execution chart of Cholesky factorization of a matrix of size 1024×1024. Color scheme follows theone from Figure 1. Different shades of green correspond to odd and even steps of the factorization.

of green distinguish the SGEMM operation in oddand even steps of the factorization. The yellow colormarks the barrier operation, which corresponds tothe load imbalance of the algorithm.

It can be observed that; load imbalance is minimal(the yellow region), dependency stalls are minimal(the white gaps), and communication and compu-tation overlapping is very good (the colored blocksrepresent purely computational blocks).

4.2.3 Synchronization

With the SPEs following a static schedule, synchro-nization is required, such that an SPE does not pro-ceed if data dependencies for an operation are notsatisfied.

Several dependencies have to be tracked. TheSSYRK and SGEMM operations cannot use as in-put tiles A and B that have not been factorized yet.The off-diagonal tiles A and B are factorized if theSTRSM operation has completed on these tiles. Inturn the STRSM operation cannot use as input adiagonal tile T that has not been factorized yet. Adiagonal tile T is factorized if the SPOTRF operationhas completed on this tile.

Dependency tracking is implemented by means of areplicated progress table. The progress table is a 2Dtriangular array with each entry representing a tileof the coefficient matrix and specifies if the tile hasbeen factorized or not, which means completion ofa SPOTRF operation for diagonal tiles and STRSMoperation for off-diagonal tiles.

By replication of the progress table, the potentialbottleneck of a centralized progress tracking mecha-nism is avoided. Each SPE can check dependenciesby testing an entry in its local copy of the progress

table. The SPE completing the factorization of atile updates the progress tables of all other SPEs,which is done by a DMA transfer and which intro-duces no overhead due to the non-blocking nature ofthese transfers. Since the smallest amount of datasubject to DMA transfer is one byte, the progresstable entries are one byte in size. These transfersconsume an insignificant amount of bandwidth of theEIB and their latency is irrelevant to the algorithm.

4.2.4 Communication

The most important feature of communication isdouble-buffering of data. With eight tile buffers avail-able, each operation involved in the factorization canbe double-buffered independently.

Thanks to this fact, double-buffering is imple-mented not only between operations of the same type,but also between different operations. In other wordsdata is always prefetched for the upcoming operation,no matter what operation it is. In absence of depen-dency stalls, the SPEs never wait for data, whichresults in big portions of the execution chart withoutany gaps between computational blocks (Figure 5).

Tiles are never exchanged internally between lo-cal stores, but always read from main memory andwritten to main memory. An attempt to do other-wise could tie up buffers in the local store and pre-vent asynchronous operation of SPEs. At the sametime, with the work partitioning implemented here,the memory system provides enough bandwidth tofully overlap communication and computation.

Reads of tiles involve dependency checks. When itcomes to prefetching of a tile, a dependency is testedand a DMA transfer is initiated if the dependency issatisfied. The DMA completion is tested right before

9

Figure 5: Magnification of a portion of the executionchart from Figure 4.

processing of the tile. If the dependency is not sat-isfied in time for the prefetch, the prefetch is aban-doned in order to not stall the execution. Instead,right before processing of the tile, the SPE busy-waitsfor the dependency and then transfers the tile in ablocking way (initiates the transfer and immediatelywaits for its completion).

4.2.5 Performance

Figure 6 shows performance of the single precisionCholesky factorization calculated as the ratio of exe-cution time to the number of floating point operationscalculated as N3/3, where N is the matrix size of theinput matrix.

Table 3 gives numerical performance values for se-lected matrix sizes in Gflop/s and also as ratios rela-tive to the peak of the processor of 204.8 Gflop/s andthe peak of the SGEMM kernel of 8× 23.01 = 184.8Gflop/s.

The factorization achieves 95 percent of the peakof the SGEMM kernel, which means that overheadsof data communication, synchronization and load im-balance are minimal, and at this point the only in-efficiency comes from the suboptimal performance ofthe SGEMM kernel. Hopefully in the future the ker-nel will be fully optimized, perhaps using hand-codedassembly.

0 1000 2000 3000 40000

25

50

75

100

125

150

175

200

Size

Gflo

p/s

SP peak

SGEMM peak

DP peak

SPOTRF

Figure 6: Performance of single precision Choleskyfactorization.

Size Gflop/s % CELL Peak % SGEMM Peak512 92 45 50640 113 55 621024 151 74 821280 160 78 871536 165 80 891664 166 81 902176 170 83 924096 175 85 95

Table 3: Selected performance points of single preci-sion Cholesky factorization.

4.3 Refinement

The two most expensive operations of the refinementare the back solve (Algorithm 1, steps 3 and 8) andresidual calculation (Algorithm 1, step 6).

The back solve consists of two triangular solvesinvolving the entire coefficient matrix and asingle right-hand-side (BLAS STRSV operation).The residual calculation is a double precisionmatrix-vector product using a symmetric matrix(BLAS DSYMV operation).

Both operations are BLAS Level 2 operations andon most processors would be strictly memory-bound.

10

The STRSV operation actually is a perfect exampleof a strictly-memory bound operation on the CELLprocessor. However, the DSYMV operation is on theborder line of being memory bound versus computebound due to very high speed of the memory sys-tem versus the relatively low performance of doubleprecision arithmetic.

4.3.1 Triangular Solve

The triangular solve is a perfect example of amemory-bound workload, where the memory accessrate sets the upper limit on achievable performance.The STRSV performs approximately two floatingpoint operations per each data element of four bytes,which means that the peak memory bandwidth of25.6 Gflop/s allows for at most

25.6 GB/s× 2ops/float/4bytes/float = 12.8 Gflop/s,

which is only 1/16 or 0.625 percent of the single pre-cision floating point peak of the processor. Owingto this fact, the goal of implementing memory-boundoperations is to get close to the peak memory band-width, unlike for compute-bound operations, wherethe goal is to get close to the floating point peak.This task should be readily achievable, given that asingle SPE possesses as much bandwidth as the mainmemory.

Practice shows, however, that a single SPE is notcapable of generating enough traffic to fully exploitthe bandwidth, and a few SPEs solving the prob-lem in parallel should be used. Efficient parallel im-plementation of the STRSV operation has to pursuetwo goals: continuous generation of traffic in orderto saturate the memory system and aggressive pur-suit of the algorithmic critical path in order to avoiddependency stalls. A related question is the desiredlevel of parallelism - optimal number of processing el-ements used. Since the triangular solve is rich in de-pendencies, increasing the number of SPEs increasesthe number of execution stalls caused by interpro-cessor dependencies. Obviously, there is a crossoverpoint, a sweet spot, for the number of SPEs used forthe operation.

Same as other routines in the code, the STRSVoperation processes the coefficient matrix by 64×64

tiles. Triangular solve is performed on the diagonaltiles and a matrix-vector product (SGEMV equiva-lent) is performed on the off-diagonal tiles. Process-ing of the diagonal tiles constitutes the critical path inthe algorithm. One SPE is solely devoted to process-ing of the diagonal tiles, while the goal of the othersis to saturate the memory system with processing ofthe off-diagonal tiles. The number of SPEs used toprocess the off-diagonal tiles is a function of a few pa-rameters. The efficiency of the computational kernelsused is one of the factors. In this case, number fourturned out to deliver the best results, with one SPEpursuing the critical path and three others fulfillingthe task of memory saturation. Figure 7 presents thedistribution of work in the triangular solve routines.

11

2 2 2 2

3 3 3 3 00

00

00

00

00

00

00

00

00

1 1

2 2

3 3

1 1

11

22

33

11

22

33

11

x

A

Figure 7: Distribution of work in the triangular solveroutine.

The solve is done in place. The unknown/solutionvector is read in its entirety by each SPE to its lo-cal store at the beginning and returned to the mainmemory at the end. As the computation proceeds,updated pieces of the vector are exchanged internallyby means of direct, local store to local store, commu-nication. At the end SPE 0 possesses the full solutionvector and writes it back to the main memory. Syn-chronization is implemented analogously to the syn-chronization in the factorization and is based on thereplicated triangular progress table (the same datastructure is reused).

Figure 8 shows performance, in terms of GB/s ofthe two triangular solve routines required in the solu-tion/refinement step of the algorithm. The two rou-

11

0 1000 2000 3000 40000

4

8

12

16

20

24

Size

GB

/s

Memory peak

Figure 8: Performance of the triangular solve rou-tines.

tines perform slightly differently due to different per-formance of their computational kernels. The figureshows clearly that the goal of saturating the memorysystem is achieved quite well. Performance as highas 23.77 GB/s is obtained, which is 93 percent of thepeak.

4.3.2 Matrix-Vector Product

For most hardware platforms the matrix-vector prod-uct would be a memory-bound operation, the sameas the triangular solve. On the CELL processor, how-ever, due to the relative slowness of the double preci-sion arithmetic, the operation is on the border of be-ing memory-bound and compute-bound. Even withuse of all eight SPEs, saturation of the memory isharder than in the case of the STRSV routine.

The DSYMV routine also operates on tiles. Here,however, the double precision representation of thecoefficient matrix is used with a tile size of 32×32,such that an entire tile can also be transferred with asingle DMA request. The input vector is only read inonce, at the beginning, in its entirety, and the outputvector is written back after the computation is com-pleted. Since the input matrix is symmetric, only the

lower portion is accessed, and implicit transpositionis used to reflect the effect of the upper portion - eachtile is read in only once, but applied to the outputvector twice (with and without transposition).

Since load balance is a very important aspect ofthe implementation, work is split among SPEs veryevenly by applying the partitioning reflected in Fig-ure 9. Such work distribution causes multiple writedependencies on the output vector and, in order tolet each SPE proceed without stalls, the output vec-tor is replicated on each SPE and the multiplicationis followed by a reduction step. The reduction is alsoperformed in parallel by all SPEs and introduces avery small overhead, compensated by the benefits ofvery good load balance.

00

11

22

3300 11 22 33

reduce

x

A

y

00

11

22

33

00

11

22

33

Figure 9: Distribution of work in the matrix-vectorproduct routine.

Figure 10 shows the performance, in termsof GB/s, of the double precision, symmetricmatrix-vector product routine. Performance of20.93 GB/s is achieved, which is 82 percent of thepeak. Although the DSYMV routine represents aLevel 2 BLAS operation and is parallelized amongall SPEs, it is still compute bound. Perhaps its com-putational components could be further optimized.Nevertheless, at this point the delivered level of per-formance is considered satisfactory

5 Limitations

The implementation presented here should be con-sidered a proof of concept prototype with the pur-

12

0 1000 2000 3000 40000

4

8

12

16

20

24

Size

GB

/s

Memory peak

Figure 10: Performance of the matrix-vector productroutine.

pose of establishing the upper bound on performanceachievable for mixed-precision, dense linear algebraalgorithms on the CELL processor. As such, it hasa number of limitations. Only systems of that aremultiples of 64 in size are accepted, and only sys-tems with a single right hand side are supported.There are no tests for overflow during conversionsfrom double to single precision. There is no test fornon-definiteness during the single precision factoriza-tion step. The current implementation is wasteful inits use of the main memory. The entire coefficient ma-trix is stored explicitly without taking advantage ofits symmetry, both in single precision representationand double precision representation. The maximumsize of the coefficient matrix is set to 4096, whichmakes it possible to fit the progress table in each lo-cal store. this also makes it possible to fit the entireunknown/solution vector in each local store, whichfacilitates internal, local store to local store, commu-nication and is very beneficial for performance.

6 Results and Discussion

Figure 11 compares the performance of a single pre-cision factorization (SPOTRF), solution of the sys-

0 1000 2000 3000 40000

25

50

75

100

125

150

175

200

Size

Gflo

p/s

SP peak

SGEMM peak

DP peak

DSPOSV

SPOSVSPOTRF

Figure 11: Performance of the mixed-precision algo-rithm versus the single precision algorithm on IBMCELL blade system (using one CELL processor).

tem in single precision (SPOSV), and solution of thesystem in double precision by using factorization insingle precision and iterative refinement to doubleprecision (DSPOSV). These results were obtained onan IBM CELL blade using one of the two availableCELL processors. Huge memory pages (16 MB) wereused for improved performance. The performance iscalculated as the ratio of the execution time to thenumber of floating point operations, set in all casesto N3/3. In all cases well conditioned input matriceswere used, resulting in two steps of refinement deliv-ering accuracy equal or higher than the one deliveredby the purely double precision algorithm.

At the maximum size of 4096, the factorizationachieves 175 Gflop/s and the system solution runsat the relative speed of 171 Gflop/s. At the sametime, the solution in double precision using the re-finement technique delivers the relative performanceof 156 Gflop/s, which is an overhead of less than 9percent compared to the solution of the system insingle precision. It can also be pointed out that byusing the mixed-precision algorithm, double precisionresults are delivered at a speed over 10 times greaterthan the double precision peak of the processor.

13

0 500 1000 1500 20000

20

40

60

80

100

120

140

160

Size

Gflo

p/s

SP peak

SGEMM peak

DP peak

DSPOSV

SPOSVSPOTRF

Figure 12: Performance of the mixed-precision algo-rithm versus the single precision algorithm on SonyPlayStation 3.

Figure 12 shows results obtained on the SonyPlayStation 3, using the six available SPEs and256 KB1 available memory allocated using huge pages(16 MB). For the maximum problem size of 2048, theperformance of 127 Gflop/s was achieved for the fac-torization, 122 Gflop/s for the single precision solu-tion, and 104 Gflop/s for the double precision solu-tion. In this case, the double precision solution comesat the cost of roughly 15 percent overhead comparedto the single precision solution.

7 Conclusions

The CELL Broadband Engine has a very high po-tential for dense linear algebra workloads offering avery high peak floating point performance and a ca-pability to deliver performance close to the peak forquite small problems. The same applies to the mem-ory system of the CELL processor, which allows fordata transfer rates very close to the peak bandwidthfor memory-intensive workloads.

Although the double precision performance of the

1only approximately 200 KB available to the application

CELL processor is much lower than the single preci-sion performance, mixed-precision algorithms permitexploiting the single precision speed while achievingfull double precision accuracy.

8 Future Plans

One of the main considerations for the future is appli-cation of the pipelined processing techniques to fac-torizations where the panel does not easily split intoindependent operations, like the factorizations wherepivoting is used.

Another important question is the one of replac-ing the static scheduling of operations with dynamicscheduling by an automated system and also the im-pact of such mechanisms on programming ease andproductivity.

9 Acknowledgements

The authors thank Gary Rancourt and Kirk Jor-dan at IBM for taking care of our hardware needsand arranging for partial financial support for thiswork. The authors are thankful to numerous IBMresearchers for generously sharing their CELL exper-tise, in particular Sidney Manning, Daniel Broken-shire, Mike Kistler, Gordon Fossum, Thomas Chenand Michael Perrone.

10 Code

The code is publicly available at the locationhttp://icl.cs.utk.edu/iter-ref/ → CELL BE Code.

References

[1] H. P. Hofstee. Power efficient processor archi-tecture and the Cell processor. In Proceedings ofthe 11th Int’l Symposium on High-PerformanceComputer Architecture, 2005.

[2] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R.Johns, T. R. Maeurer, and D. Shippy. Introduc-

14

tion to the Cell multiprocessor. IBM J. Res. &Dev., 49(4/5):589–604, 2005.

[3] IBM. Cell Broadband Engine Architecture, Ver-sion 1.0, August 2005.

[4] J. H. Wilkinson. Rounding Errors in AlgebraicProcesses. Prentice-Hall, 1963.

[5] C. B. Moler. Iterative refinement in floatingpoint. J. ACM, 14(2):316–321, 1967.

[6] G. W. Stewart. Introduction to Matrix Compu-tations. Academic Press, 1973.

[7] N. J. Higham. Accuracy and Stability of Numer-ical Algorithms. SIAM, 1996.

[8] J. Langou, J. Langou, P. Luszczek, J. Kurzak,A. Buttari, and J. J. Dongarraa. Exploiting theperformance of 32 bit floating point arithmeticin obtaining 64 bit accuracy. In Proceedings ofthe 2006 ACM/IEEE Conference on Supercom-puting, 2006.

[9] IBM. Cell Broadband Engine ProgrammingHandbook, Version 1.0, April 2006.

[10] IBM. Cell Broadband Engine Programming Tu-torial, Version 2.0, December 2006.

[11] A. Buttari, P. Luszczek, J. Kurzak, J. J. Don-garra, and G. Bosilca. A rough guide to scien-tific computing on the PlayStation 3, version 1.0.Technical Report UT-CS-07-595, Computer Sci-ence Department, University of Tennessee, 2007.http://www.cs.utk.edu/ library/ TechReports/2007/ ut-cs-07-595.pdf.

[12] J. J. Dongarra, I. S. Duff, D. C. Sorensen, andH. A. van der Vorst. Numerical Linear Algebrafor High-Performance Computers. SIAM, 1998.

[13] J. Kurzak and J. J. Dongarraa. Implementinglinear algebra routines on multi-core processorswith pipelining and a look-ahead. In Proceed-ings of the 2006 Workshop on State-of-the-Art inScientific and Parallel Computing (PARA’O6),Umea, Sweden, 2006. Lecture Notes in Com-puter Science series, Springer, 2007 (to appear).

[14] J Kurzak and J. J. Dongarra. Implementationof mixed precision in solving systems of linearequations on the CELL processor. ConcurrencyComputat.: Pract. Exper, 2007. in press, DOI:10.1002/cpe.1164.

[15] T. Chen, R. Raghavan, J. Dale, and E. Iwata.Cell Broadband Engine architecture and itsfirst implementation, A performance view.http://www-128.ibm.com/ developerworks/power/ library/ pa-cellperf/, November 2005.

15

Documents

Solving Systems of Linear Equations on the CELL Processor ...eprints.ma.man.ac.uk/821/01/Solving_Systems.pdf · solving a symmetric positive deﬁnite system of lin-ear equation in