High-Performance Grid Computing and Research Networking

High-Performance Grid Computing and Research NetworkingPresented by Xing Hang

Instructor: S. Masoud Sadjadihttp://www.cs.fiu.edu/~sadjadi/Teaching/sadjadi At cs Dot fiu Dot eduAlgorithms on a Grid of Processors

*

AcknowledgementsThe content of many of the slides in this lecture notes have been adopted from the online resources prepared previously by the people listed below. Many thanks! Henri CasanovaPrinciples of High Performance Computinghttp://navet.ics.hawaii.edu/[email protected]

*

2-D Torus topologyWeve looked at a ring, but for some applications its convenient to look at a 2-D grid topology

A 2-D grid with wrap-around is called a 2-D torus. Advanced parallel linear algebra libraries/languages allow to combine arbitrary data distribution strategies with arbitrary topologies (ScaLAPACK, HPF)1-D block to a ring2-D block to a 2-D gridcyclic or non-cyclic (more on this later)We can go through all the algorithms we saw on a ring and make them work on a gridIn practice, for many linear algebra kernel, using a 2-D block-cyclic on a 2-D grid seems to work best in most situationsweve seen that blocks are good for localityweve seen that cyclic is good for load-balancingpp

*

Semantics of a parallel linear algebra routine?Centralizedwhen calling a function (e.g., LU) the input data is available on a single master machinethe input data must then be distributed among workersthe output data must be undistributed and returned to the master machineMore natural/easy for the userAllows for the library to make data distribution decisions transparently to the userProhibitively expensive if one does sequences of operationsand one almost always does soDistributedwhen calling a function (e.g., LU)Assume that the input is already distributedLeave the output distributedMay lead to having to redistributed data in between calls so that distributions match, which is harder for the user and may be costly as wellFor instance one may want to change the block size between calls, or go from a non-cyclic to a cyclic distributionMost current software adopt distributedmore work for the usermore flexibility and control

*

Matrix-matrix multiplyMany people have thought of doing a matrix-multiply on a 2-D torusAssume that we have three matrices A, B, and C, of size NxNAssume that we have p processors, so that p=q2 is a perfect square and our processor grid is qxqWere looking at a 2-D block distribution, but not cyclicagain, that would obfuscate the code too much

Were going to look at three classic algorithms: Cannon, Fox, Snyder

A00A01A02A03A10A11A12A13A20A21A22A23A30A31A32A33

C00C01C02C03C10C11C12C13C20C21C22C23C30C31C32C33

B00B01B02B03B10B11B12B13B20B21B22B23B30B31B32B33

*

Cannons Algorithm (1969)Very simple (from systolic arrays)Starts with a data redistribution for matrices A and Bgoal is to have only neighbor-to-neighbor communicationsA is circularly shifted/rotated horizontally so that its diagonal is on the first column of processors B is circularly shifted/rotated vertically so that its diagonal is on the first row of processorsCalled preskewing




*

Cannons AlgorithmPreskewing of A and BFor k = 1 to q in parallel Local C = C + A*B Vertical shift of B Horizontal shift of APostskewing of A and B

Of course, computation and communication could be done in an overlapped fashion locally at each processor

*

Execution Steps...local computationon proc (0,0)Shiftslocal computationon proc (0,0)










*

Foxs Algorithm(1987)Originally developed for CalTechs HypercubeUses broadcasts and is also called broadcast-multiply-roll algorithmbroadcasts diagonals of matrix A

Uses a shift of matrix BNo preskewing stepfirst diagonalsecond diagonalthird diagonal...

*

Execution Steps...initialstateBroadcast of As 1st diagonalLocalcomputation










*

Execution Steps...Shift of BBroadcast of As 2nd diagonalLocalcomputation










*

Foxs Algorithm

// No initial data movementfor k = 1 to q-1 in parallel Broadcast As kth diagonal Local C = C + A*B Vertical shift of B// No final data movement

Note that there is an additional array to store incoming diagonal block

*

Snyders Algorithm (1992)More complex than Cannons or Foxs First transposes matrix BUses reduction operations (sums) on the rows of matrix CShifts matrix B

*

Execution Steps...initialstateTranspose BLocalcomputation










*

Execution Steps...Shift BGlobal sumon the rowsof CLocalcomputation










*

Execution Steps...Shift BGlobal sumon the rowsof CLocalcomputation










*

Complexity AnalysisVery cumbersomeTwo models4-port model: every processor can communicate with its 4 neighbors in one stepCan match underlying architectures like the Intel Paragon1-port model: only one single communication at a time for each processorBoth models are assumed bi-directional communication

*

One-port resultsCannon

Fox

Snyder

*

Complexity Resultsm in these expressions is the block sizeExpressions for the 4-port model are MUCH more complicatedRemember that this is all for non-cyclic distributionsformulae and code become very complicated for a full-fledge implementation (nothing divides anything, nothings a perfect square, etc.)Performance analysis of real code is known to be hardIt is done in a few restricted casesAn interesting approach is to use simulationDone in ScaLAPACK (Scalable Linear Algebra PACKage) for instanceEssentially: you have written a code so complex you just run a simulation of it to figure out how fast it goes in different cases

*

So What?Are we stuck with these rather cumbersome algorithms? Fortunately, there is a much simpler algorithm thatsnot as cleverabout as good in practice anywayThats the one youll implement in your programming assignment

*

The Outer-Product AlgorithmRemember the sequential matrix multiplyfor i = 1 to n for j = 1 to n for k = 1 to n Cij = Cij + Aik * BkjThe first two loops are completely parallel, but the third one isnt i.e., in shared memory, would require a mutex to protect the writing of shared variable CijOne solution: view the algorithm as n sequential stepsfor k = 1 to n // done in sequence for i = 1 to n // done in parallel for j = 1 to n // done in parallel Cij = Cij + Aik * Bkj

*

The Outer-Product Algorithmfor k = 1 to n // done in sequence for i = 1 to n // done in parallel for j = 1 to n // done in parallel Cij = Cij + Aik * BkjDuring the kth step, the processor who owns Cij needs Aik and BkjTherefore, at the kth step, the kth column on A and the kth row of B must be broadcasted over all processors Let us assume a 2-D block distribution

*

2-D Block distribution kkAt each step, n-q processors receive a piece of the kth column and n-q processors receive a piece of the kth row (n=q2 processors)




*

Outer-Product AlgorithmOnce everybody has received a piece of row k, everybody can add to the Cijs they are responsible forAnd this is repeated n timesIn your programming assignment:Implement the outer-product algorithmDo the theoretical performance analysiswith assumptions similar to the ones we have used in class so far

*

Further OptimizationsSend blocks of rows/column to avoid too many small transfersWhat is the optimal granularity?Overlap communication and computation by using asynchronous communicationHow much can be gained?

This is a simple and effective algorithm that is not too cumbersome

*

Cyclic 2-D distributionsWhat if I want to run on 6 processors?Its not a perfect squareIn practice, one makes distributions cyclic to accommodate various numbers of processorsHow do we do this in 2-D?i.e, how do we do a 2-D block cyclic distribution?

*

The 2-D block cyclic distributionGoal: try to have all the advantages of both the horizontal and the vertical 1-D block cyclic distributionWorks whichever way the computation progressesleft-to-right, top-to-bottom, wavefront, etc.Consider a number of processors p = r * carranged in a rxc matrixConsider a 2-D matrix of size NxNConsider a block size b (which divides N)

*

The 2-D block cyclic distributionbbNP0P1P2P5P4P3

*

The 2-D block cyclic distributionP2P5P1P4P0P3bbNP0P1P2P5P4P3

*

The 2-D block cyclic distributionP2P0P1P2P0P1P5P3P4P5P3P4P1P4P0P3bbNP0P1P2P5P4P3P2P0P1P2P0P1P5P3P4P5P3P4P1P4P0P3P2P0P1P2P0P1P5P3P4P5P3P4P1P4P0P3P2P0P1P2P0P1P1P0Slight load imbalanceBecomes negligible with many blocksIndex computations had better be implemented in separate functionsAlso: functions that tell a process who its neighbors areOverall, requires a whole infrastructure, but many think you cant go wrong with this distribution

**By analogy with the regular pumping of blood by the heart, a systolic array is an arrangement of processors in an array (often rectangular) where data flows synchronously across the array between neighbours, usually with different data flowing in different directions. Preskewing?*for k = 1 to n // done in sequence*

Documents

High-Performance Grid Computing and Research Networking