High-Performance Grid Computing and Research Networking

  • Upload
    blake

  • View
    68

  • Download
    3

Embed Size (px)

DESCRIPTION

High-Performance Grid Computing and Research Networking. Algorithms on a Grid of Processors. Presented by Xing Hang Instructor: S. Masoud Sadjadi http://www.cs.fiu.edu/~sadjadi/Teaching/ sadjadi At cs Dot fiu Dot edu. Acknowledgements. - PowerPoint PPT Presentation

Citation preview

  • High-Performance Grid Computing and Research NetworkingPresented by Xing Hang

    Instructor: S. Masoud Sadjadihttp://www.cs.fiu.edu/~sadjadi/Teaching/sadjadi At cs Dot fiu Dot eduAlgorithms on a Grid of Processors

    *

    AcknowledgementsThe content of many of the slides in this lecture notes have been adopted from the online resources prepared previously by the people listed below. Many thanks! Henri CasanovaPrinciples of High Performance Computinghttp://navet.ics.hawaii.edu/[email protected]

    *

    2-D Torus topologyWeve looked at a ring, but for some applications its convenient to look at a 2-D grid topology

    A 2-D grid with wrap-around is called a 2-D torus. Advanced parallel linear algebra libraries/languages allow to combine arbitrary data distribution strategies with arbitrary topologies (ScaLAPACK, HPF)1-D block to a ring2-D block to a 2-D gridcyclic or non-cyclic (more on this later)We can go through all the algorithms we saw on a ring and make them work on a gridIn practice, for many linear algebra kernel, using a 2-D block-cyclic on a 2-D grid seems to work best in most situationsweve seen that blocks are good for localityweve seen that cyclic is good for load-balancingpp

    *

    Semantics of a parallel linear algebra routine?Centralizedwhen calling a function (e.g., LU) the input data is available on a single master machinethe input data must then be distributed among workersthe output data must be undistributed and returned to the master machineMore natural/easy for the userAllows for the library to make data distribution decisions transparently to the userProhibitively expensive if one does sequences of operationsand one almost always does soDistributedwhen calling a function (e.g., LU)Assume that the input is already distributedLeave the output distributedMay lead to having to redistributed data in between calls so that distributions match, which is harder for the user and may be costly as wellFor instance one may want to change the block size between calls, or go from a non-cyclic to a cyclic distributionMost current software adopt distributedmore work for the usermore flexibility and control

    *

    Matrix-matrix multiplyMany people have thought of doing a matrix-multiply on a 2-D torusAssume that we have three matrices A, B, and C, of size NxNAssume that we have p processors, so that p=q2 is a perfect square and our processor grid is qxqWere looking at a 2-D block distribution, but not cyclicagain, that would obfuscate the code too much

    Were going to look at three classic algorithms: Cannon, Fox, Snyder

    A00A01A02A03A10A11A12A13A20A21A22A23A30A31A32A33

    C00C01C02C03C10C11C12C13C20C21C22C23C30C31C32C33

    B00B01B02B03B10B11B12B13B20B21B22B23B30B31B32B33

    *

    Cannons Algorithm (1969)Very simple (from systolic arrays)Starts with a data redistribution for matrices A and Bgoal is to have only neighbor-to-neighbor communicationsA is circularly shifted/rotated horizontally so that its diagonal is on the first column of processors B is circularly shifted/rotated vertically so that its diagonal is on the first row of processorsCalled preskewing

    A00A01A02A03A11A12A13A10A22A23A20A21A33A30A31A32

    C00C01C02C03C10C11C12C13C20C21C22C23C30C31C32C33

    B00B11B22B33B10B21B32B03B20B31B02B13B30B01B12B23

    *

    Cannons AlgorithmPreskewing of A and BFor k = 1 to q in parallel Local C = C + A*B Vertical shift of B Horizontal shift of APostskewing of A and B

    Of course, computation and communication could be done in an overlapped fashion locally at each processor

    *

    Execution Steps...local computationon proc (0,0)Shiftslocal computationon proc (0,0)

    A00A01A02A03A11A12A13A10A22A23A20A21A33A30A31A32

    C00C01C02C03C10C11C12C13C20C21C22C23C30C31C32C33

    B00B11B22B33B10B21B32B03B20B31B02B13B30B01B12B23

    A01A02A03A00A12A13A10A11A23A20A21A22A30A31A32A33

    C00C01C02C03C10C11C12C13C20C21C22C23C30C31C32C33

    B10B21B32B03B20B31B02B13B30B01B12B23B00B11B22B33

    A01A02A03A00A12A13A10A11A23A20A21A22A30A31A32A33

    C00C01C02C03C10C11C12C13C20C21C22C23C30C31C32C33

    B10B21B32B03B20B31B02B13B30B01B12B23B00B11B22B33

    *

    Foxs Algorithm(1987)Originally developed for CalTechs HypercubeUses broadcasts and is also called broadcast-multiply-roll algorithmbroadcasts diagonals of matrix A

    Uses a shift of matrix BNo preskewing stepfirst diagonalsecond diagonalthird diagonal...

    *

    Execution Steps...initialstateBroadcast of As 1st diagonalLocalcomputation

    A00A01A02A03A10A11A12A13A20A21A22A23A30A31A32A33

    C00C01C02C03C10C11C12C13C20C21C22C23C30C31C32C33

    B00B01B02B03B10B11B12B13B20B21B22B23B30B31B32B33

    A00A00A00A00A11A11A11A11A22A22A22A22A33A33A33A33

    C00C01C02C03C10C11C12C13C20C21C22C23C30C31C32C33

    A00A00A00A00A11A11A11A11A22A22A22A22A33A33A33A33

    C00C01C02C03C10C11C12C13C20C21C22C23C30C31C32C33

    B00B01B02B03B10B11B12B13B20B21B22B23B30B31B32B33

    B00B01B02B03B10B11B12B13B20B21B22B23B30B31B32B33

    *

    Execution Steps...Shift of BBroadcast of As 2nd diagonalLocalcomputation

    A00A01A02A03A10A11A12A13A20A21A22A23A30A31A32A33

    C00C01C02C03C10C11C12C13C20C21C22C23C30C31C32C33

    B10B11B12B13B20B21B22B23B30B31B32B33B00B01B02B03

    A01A01A01A01A12A12A12A12A23A23A23A23A30A30A30A30

    C00C01C02C03C10C11C12C13C20C21C22C23C30C31C32C33

    C00C01C02C03C10C11C12C13C20C21C22C23C30C31C32C33

    B10B11B12B13B20B21B22B23B30B31B32B33B00B01B02B03

    A01A01A01A01A12A12A12A12A23A23A23A23A30A30A30A30

    B10B11B12B13B20B21B22B23B30B31B32B33B00B01B02B03

    *

    Foxs Algorithm

    // No initial data movementfor k = 1 to q-1 in parallel Broadcast As kth diagonal Local C = C + A*B Vertical shift of B// No final data movement

    Note that there is an additional array to store incoming diagonal block

    *

    Snyders Algorithm (1992)More complex than Cannons or Foxs First transposes matrix BUses reduction operations (sums) on the rows of matrix CShifts matrix B

    *

    Execution Steps...initialstateTranspose BLocalcomputation

    A00A01A02A03A10A11A12A13A20A21A22A23A30A31A32A33

    C00C01C02C03C10C11C12C13C20C21C22C23C30C31C32C33

    B00B01B02B03B10B11B12B13B20B21B22B23B30B31B32B33

    C00C01C02C03C10C11C12C13C20C21C22C23C30C31C32C33

    C00C01C02C03C10C11C12C13C20C21C22C23C30C31C32C33

    B00B10B20B30B01B11B21B31B02B12B22B32B03B13B23B33

    B00B10B20B30B01B11B21B31B02B12B22B32B03B13B23B33

    A00A01A02A03A10A11A12A13A20A21A22A23A30A31A32A33

    A00A01A02A03A10A11A12A13A20A21A22A23A30A31A32A33

    *

    Execution Steps...Shift BGlobal sumon the rowsof CLocalcomputation

    C00C01C02C03C10C11C12C13C20C21C22C23C30C31C32C33

    B01B11B21B31B02B12B22B32B03B13B23B32B00B10B20B30

    C00C01C02C03C10C11C12C13C20C21C22C23C30C31C32C33

    B01B11B21B31B02B12B22B32B03B13B23B32B00B10B20B30

    A00A01A02A03A10A11A12A13A20A21A22A23A30A31A32A33

    A00A01A02A03A10A11A12A13A20A21A22A23A30A31A32A33

    C00C01C02C03C10C11C12C13C20C21C22C23C30C31C32C33

    B01B11B21B31B02B12B22B32B03B13B23B32B00B10B20B30

    A00A01A02A03A10A11A12A13A20A21A22A23A30A31A32A33

    *

    Execution Steps...Shift BGlobal sumon the rowsof CLocalcomputation

    C00C01C02C03C10C11C12C13C20C21C22C23C30C31C32C33

    A00A01A02A03A10A11A12A13A20A21A22A23A30A31A32A33

    B02B12B22B32B03B13B23B33B00B10B20B30B01B11B21B31

    C00C01C02C03C10C11C12C13C20C21C22C23C30C31C32C33

    A00A01A02A03A10A11A12A13A20A21A22A23A30A31A32A33

    B02B12B22B32B03B13B23B33B00B10B20B30B01B11B21B31

    C00C01C02C03C10C11C12C13C20C21C22C23C30C31C32C33

    A00A01A02A03A10A11A12A13A20A21A22A23A30A31A32A33

    B02B12B22B32B03B13B23B33B00B10B20B30B01B11B21B31

    *

    Complexity AnalysisVery cumbersomeTwo models4-port model: every processor can communicate with its 4 neighbors in one stepCan match underlying architectures like the Intel Paragon1-port model: only one single communication at a time for each processorBoth models are assumed bi-directional communication

    *

    One-port resultsCannon

    Fox

    Snyder

    *

    Complexity Resultsm in these expressions is the block sizeExpressions for the 4-port model are MUCH more complicatedRemember that this is all for non-cyclic distributionsformulae and code become very complicated for a full-fledge implementation (nothing divides anything, nothings a perfect square, etc.)Performance analysis of real code is known to be hardIt is done in a few restricted casesAn interesting approach is to use simulationDone in ScaLAPACK (Scalable Linear Algebra PACKage) for instanceEssentially: you have written a code so complex you just run a simulation of it to figure out how fast it goes in different cases

    *

    So What?Are we stuck with these rather cumbersome algorithms? Fortunately, there is a much simpler algorithm thatsnot as cleverabout as good in practice anywayThats the one youll implement in your programming assignment

    *

    The Outer-Product AlgorithmRemember the sequential matrix multiplyfor i = 1 to n for j = 1 to n for k = 1 to n Cij = Cij + Aik * BkjThe first two loops are completely parallel, but the third one isnt i.e., in shared memory, would require a mutex to protect the writing of shared variable CijOne solution: view the algorithm as n sequential stepsfor k = 1 to n // done in sequence for i = 1 to n // done in parallel for j = 1 to n // done in parallel Cij = Cij + Aik * Bkj

    *

    The Outer-Product Algorithmfor k = 1 to n // done in sequence for i = 1 to n // done in parallel for j = 1 to n // done in parallel Cij = Cij + Aik * BkjDuring the kth step, the processor who owns Cij needs Aik and BkjTherefore, at the kth step, the kth column on A and the kth row of B must be broadcasted over all processors Let us assume a 2-D block distribution

    *

    2-D Block distribution kkAt each step, n-q processors receive a piece of the kth column and n-q processors receive a piece of the kth row (n=q2 processors)

    A00A01A02A03A10A11A12A13A20A21A22A23A30A31A32A33

    C00C01C02C03C10C11C12C13C20C21C22C23C30C31C32C33

    B00B01B02B03B10B11B12B13B20B21B22B23B30B31B32B33

    *

    Outer-Product AlgorithmOnce everybody has received a piece of row k, everybody can add to the Cijs they are responsible forAnd this is repeated n timesIn your programming assignment:Implement the outer-product algorithmDo the theoretical performance analysiswith assumptions similar to the ones we have used in class so far

    *

    Further OptimizationsSend blocks of rows/column to avoid too many small transfersWhat is the optimal granularity?Overlap communication and computation by using asynchronous communicationHow much can be gained?

    This is a simple and effective algorithm that is not too cumbersome

    *

    Cyclic 2-D distributionsWhat if I want to run on 6 processors?Its not a perfect squareIn practice, one makes distributions cyclic to accommodate various numbers of processorsHow do we do this in 2-D?i.e, how do we do a 2-D block cyclic distribution?

    *

    The 2-D block cyclic distributionGoal: try to have all the advantages of both the horizontal and the vertical 1-D block cyclic distributionWorks whichever way the computation progressesleft-to-right, top-to-bottom, wavefront, etc.Consider a number of processors p = r * carranged in a rxc matrixConsider a 2-D matrix of size NxNConsider a block size b (which divides N)

    *

    The 2-D block cyclic distributionbbNP0P1P2P5P4P3

    *

    The 2-D block cyclic distributionP2P5P1P4P0P3bbNP0P1P2P5P4P3

    *

    The 2-D block cyclic distributionP2P0P1P2P0P1P5P3P4P5P3P4P1P4P0P3bbNP0P1P2P5P4P3P2P0P1P2P0P1P5P3P4P5P3P4P1P4P0P3P2P0P1P2P0P1P5P3P4P5P3P4P1P4P0P3P2P0P1P2P0P1P1P0Slight load imbalanceBecomes negligible with many blocksIndex computations had better be implemented in separate functionsAlso: functions that tell a process who its neighbors areOverall, requires a whole infrastructure, but many think you cant go wrong with this distribution

    **By analogy with the regular pumping of blood by the heart, a systolic array is an arrangement of processors in an array (often rectangular) where data flows synchronously across the array between neighbours, usually with different data flowing in different directions. Preskewing?*for k = 1 to n // done in sequence*