OPTIMIZING DENSE LINEAR ALGEBRA ALGORITHMS ON ...tavares/downloads/publications/...out [7, 8] and the strategy adopted consists in the design of a static data distribution algorithm

OPTIMIZING DENSE LINEAR ALGEBRA ALGORITHMS ONHETEROGENEOUS MACHINES∗

JORGE BARBOSA† , JOAO TAVARES‡ , AND A.J. PADILHA†

Abstract. This paper addresses the execution of inherently sequential linear algebra algorithms namely LUfactorization, tridiagonal reduction and the symmetric QR factorization algorithm used for eigenvector computation,which are significant building blocks for applications in our target image processing and analysis domain. Thesealgorithms present additional difficulties to optimize the processing time due to the fact that the computational loadfor data matrix columns increases with their index, requiring a fine tuned load assignment and distribution. Wepresent an efficient methodology to determine the optimal number of processors to be used in a computation, as wellas a new static load distribution strategy that achieves better results than other algorithms developed for the samepurpose.

Key words. heterogeneous computing, load balancing, static distribution, linear algebra algorithms, processingtime optimization, distributed memory computer

1. Introduction. This paper addresses the problem of executing efficiently dense linear algebraalgorithms on a cluster of personal computers. These algorithms have been extensively studied forhomogeneous machines [10, 13, 16, 17]. For heterogeneous machines some studies have been carriedout [7, 8] and the strategy adopted consists in the design of a static data distribution algorithm toadapt the parallel implementation of the linear algebra algorithms, as developed for homogeneousmachines, to the heterogeneous paradigm. In these studies it was assumed that all processorsavailable were used. For our target field, image analysis, this may not be the general case so thatsometimes it can be beneficial to keep some processors idle.

The new static distribution strategy herein presented and named Group Block, achieves betterresults than others [7, 8] in optimizing the execution on heterogeneous machines. As a byproductwe also present a methodology to compute the number of processors that optimize the processingtime of a given algorithm, for a given input data, on the heterogeneous cluster, based on the com-putational model of the machine (section 4). From a pool of available heterogeneous nodes, it isdetermined which ones should be selected for building a parallel virtual computer that achieves thefastest application response time. The study considers that the nodes available prior to runningthe application may differ from time to time, as different users and machines are active. At everyprogram initiation phase, the highest performance computers from the available set are selected, ina number that is computed for optimizing the processing time.

Our aim is not to build a cluster of personal computers for parallel processing but rather toimplement parallel processing on already existing group clusters, where each node is a desktopcomputer. These clusters are characterized by using a low cost network, such as an Ethernet,connecting different types of processors, of variable processing capacity and amount of memory,thus forming a heterogeneous parallel virtual computer. Due to network restrictions, which do notallow simultaneous communication among several nodes, the application domain is restricted to oneor two dozens of processors.

The need for a methodology to determine the ideal number of processors comes also due tonetwork restrictions, since as the number of processors increases the network acts as a communicationbottleneck and the time spent in data exchange can overcome the benefits of more processing power.This fact is not usually addressed in the high performance clusters literature, due to the usuallyhuge problem size; however, in [25] a scheduling policy is studied for multiprocessor systems basedon that some applications cannot exploit the computational power available, due to hardware andsoftware constraints.

Our motivation for a parallel implementation of linear algebra algorithms comes from imageand image sequence analysis needs, posed by various application domains, which are becoming

∗This work was supported by the PRAXIS XXI Program, Portugal.†Instituto de Engenharia Biomedica, Faculdade de Engenharia da Universidade do Porto, DEEC, Rua Dr. Roberto

Frias, 4200-465 Porto, Portugal (E-mail: [jbarbosa,padilha]@fe.up.pt).‡Instituto de Engenharia Biomedica, Faculdade de Engenharia da Universidade do Porto, DEMEGI, Rua Dr.

Roberto Frias, 4200-465 Porto, Portugal (E-mail: [email protected])

17

18 J. BARBOSA, J. TAVARES AND A.J. PADILHA

increasingly more demanding in terms of the detail and variety of the expected analytic results,requiring the use of more sophisticated image and object models (e.g., physically-based deformablemodels), and of more complex algorithms, while the timing constraints are kept very stringent.

A promising approach to deal with the above requirements consists in developing parallel soft-ware to be executed, in a distributed manner, by the machines available in an existing computernetwork, taking advantage of the well-known fact that many of the computers are often idle for longperiods of time [22].

Existing software, such as PVM and MPI, allows building parallel virtual computers by inte-grating into a common processing environment a set of distinct machines (nodes) connected to thenetwork. Although the parallel virtual computer nodes and the underlying communication net-work were not designed for optimized parallel operation, very significant performance gains can beattained if the parallel application software is conceived for that specific environment.

The test cases presented, LU factorization algorithm and components for eigenvector computa-tion, while pertinent to many advanced image analysis methods, are also a common module in manyother fields, such as in simulation problems.

The implementation developed of the linear algebra algorithms follows the strategy of blockparallelization described in [10, 13]. Instead of using an already available implementation, it wasdeveloped a new one in order to be able to change the distribution strategy without restrictions.

The remainder of the paper is organized as follows: In section 2 the characteristics of inherentlysequential algorithms and the parallel implementation are introduced. Section 3 refers to the relatedwork on implementing linear algebra algorithms on heterogeneous machines. Section 4 presentsthe computational model for the cluster, which is used to estimate the computation time of thealgorithms. Section 5 presents the new static distribution algorithm, called Group Block, thatguarantees the requirement of achieving a balanced load. Section 6 describes the methodologyproposed to compute the optimum number of processors. Section 7 shows results of applying theoptimization methodology to eigenvector computation.

2. Inherently Sequential Algorithms. This class of algorithms is characterized by the factthat the computation of a given data element requires the result obtained in previously processedelements, and there is a predefined pattern of computation. The LU factorization is used to presentthe characteristics of these algorithms.

2.1. LU factorization. The LU factorization algorithm is used to solve a system of equationsAx = v. The initial matrix A is replaced in memory by a lower triangular matrix (L) with onesat the diagonal and a upper triangular matrix (U). The diagonal of L is not stored since theelements are all 1. Figure 2.1 (left) shows the state of the matrix A at step i. First, a Gaussianelimination is performed at column i, resulting the final elements on row and column i. That datais then used to update the matrix A′ by a vector-vector multiplication. The result matrices L andU that replace A, once their elements are computed they are not visited again. This leads to twoimportant characteristics of the algorithm: first, the input matrix at each step reduces in size as thecomputation progresses and second, the number of flops required to compute each column dependson their position. Figure 2.1 (right) shows the flop count for each column for matrices of size (n, n)and n = {1200, 1600, 1800}. For a given column u it is required 2u(n − 1) + n − u2 − 1 flops toobtain the final result, i.e. u updates and one partial pivoting.

2.2. Parallelization of the LU factorization. The parallel implementation of the LU algo-rithm for a distributed memory computer (cluster) consists in rewriting a state of the art sequentialalgorithm in order to extract its intrinsic parallelism. This approach is described in [9, 13] andconsists in delaying the update of the sub-matrix A′ in order to optimize the performance of eachnode (a complete computer), although the same sequence of computations is followed.

3. Related Work. Static distribution strategies represent an application as a Directed AcyclicGraph (DAG). A node in the DAG represents a task and a directed edge represents the precedenceamong the nodes. To each node is assigned a weight representing its computation complexity. Thescheduling consists on the distribution of the DAG nodes among the machine processors, so that thetotal computation time is minimum [21]. The general case of static task scheduling is NP-complete

OPTIMIZING DENSE LINEAR ALGEBRA ALGORITHMS ON ... 19

A'

U

L

A(i,i) A(i,k)

A(j,i) A(j,k)

A(i,u) n = 1200n = 1600n = 1800

column u

flops

160012008004000

3.0e6

2.0e6

1.0e6

0

Vector oriented implementation F lop count per column for matrices of size (n, n)

Fig. 2.1. LU Factorization algorithm

and so various heuristics have been proposed for the homogeneous case [27]. The heterogeneousscheduling problem is more recent but there have been heuristics proposed for those platforms[6, 30].

For highly constrained algorithms such as those of linear algebra considered in this paper, theschedule is usually computed without explicit use of a DAG. The schedule is obtained by a datadistribution algorithm that decides which data item goes to each processor. For the homogeneouscase this is implemented by the well known block cycle strategy [13], which is applied in most linearalgebra packages such as LAPACK (www.netlib.org/lapack/).

For heterogeneous machines, a strategy based on duplicating the DAG critical path was presentedin [20]. This strategy leads to a fine grain parallelization due to the high number of tasks generatedto execute the algorithm and therefore is not suitable for our general purpose cluster environment.

An approach based on the definition of a static data distribution algorithm was carried outin [7, 8]. The requirement of minimizing the completion time is satisfied by a distribution algo-rithm that balances the load among the heterogeneous processors. This algorithm generates staticdistributions and it uses a dynamic programming minimization algorithm to compute them. Forperformance comparison purposes we briefly review next this approach, that we call the DynamicProgramming Algorithm.

Dynamic Programming Algorithm: This algorithm was proposed to obtain the optimal staticdistribution in a heterogeneous machine for algorithms such as LU. The load balancing problem issolved in two phases: block assignment and distribution. The W data blocks of a given matrix areconsidered equal size in amount of work and having an average execution time of t0, t1, . . . , tP−1

over the P processors, respectively.The assignment is obtained by giving to each processor a load proportional to its speed: pi

receives ci blocks so that ci · ti = constant. The distribution consists in defining the blocks that goto each processor. For that, it is used a dynamic programming algorithm which from 1 to W , givesblock j to processor k that satisfies tk · (ck + 1) = min{ti · (ci + 1)} and 0 ≤ i ≤ P − 1.

In the algorithm it is assumed implicitly that all blocks have the same processing time ti forprocessor i, which is an average measure. Both assignment and distribution are based on thesemeasures, therefore this algorithm implements only a global control of the load distribution.

The idea of leaving processors in idle state during a computation is not usually referred in theparallel processing literature, mainly due to the huge problem dimension considered to the availablecomputing power. However, some authors consider this possibility. [25] presents a scheduling policyfor multiprocessor systems based on that some applications cannot exploit the computational poweravailable, due to hardware and software constraints. That paper defines distribution policies modeledby Markov chains and the equation of global load balancing is solved to compute the optimum numberof processors. In [5] it was presented an application oriented scheduling system to determine whichcomputers to use on a heterogeneous environment in processing nodes and in network heterogeneity.


In [18] it is shown for a simulation program of the electron flow in semiconductors that the applicationcould not usefully utilize more than an optimal number of processors, due to synchronization andother overheads that increase rapidly with the number of processors. This idea was explored as amethodology in [29] where the program performance is described by a full timing model.

Most of the proposed algorithms do not consider congestion of the network as they assume thatcommunications between different processors can occur in parallel. The methodology presented inthis paper considers a shared network and includes mechanisms for contention awareness.

4. Computational Model. In this section it is defined a computational model to describemathematically the execution of algorithms in a heterogeneous cluster. It will be used to obtain thenumber of processors that minimizes the computation time.

Several computational models [11, 19, 31] were presented in order to estimate the processingtime of a parallel algorithm on a distributed memory computer. Although LogP [11], a generalmodel for distributed memory architectures, could be adapted for the cluster of personal computers,a simplified model is defined here that represents the cluster architecture and the collision avoidancestrategy.

The target machine is composed by nodes with different processing capacities, being character-ized by the processor capacity S, measured in Mflop/s. The network is characterized by dividing amessage into small packets that are sent sequentially and by the bandwidth LB measured in Mbit/s.Only one packet is transmitted at each time and the sender has to gain access to the channel [28].

The computational model for the virtual machine, describing the behavior for a given algorithm,is obtained by summing the time spent in sequential operations TS and the time spent in parallel op-erations TP . Sequential operations include communications, data input/output and other processingthat cannot occur in parallel due to each particular algorithm characteristics. Parallel operationsare those for which the time spent by one processor can be divided by p if p processors are used.The total processing time, as a function of the number of processors p and the problem size n isgiven by equation 4.1.

TT (n, p) = TS(n, p) + TP (n, p)(4.1)

The interconnection network is modeled by a temporal expression, TC , representing the time re-quired to transmit a message of nb bits between two network nodes, assuming, for now, a contention-free network.

TC = KTL + nb

(1LB

+ TE

)(4.2)

The latency time TL represents the time gap between the processor order to transmit and thebeginning of transmission, and TE represents the packing time. K is the number of packets intowhich a message is divided. The parameters TL and TE depend on the processor speed S.

Several experiments were conducted in order to measure these parameters, for the networkreferred to in the results section, which is composed by processors as illustrated in Table 4.1. Thevalues were measured for the matrix multiplication algorithm over different matrix sizes, resultingthe average values of Table 4.1.

S(Mflop/s) 244 161 60 50 49TL(µs/byte) 70 130 180 180 180TE(µs/byte) 0.05 0.07 0.13 0.13 0.13

Table 4.1Processors parameters

Although the Ethernet physically allows broadcasting, the PVM daemon converts a broadcastin a p processor machine into p − 1 messages [14]. Therefore, to correctly model a broadcast thetime spent in one message has to be multiplied by p− 1.


Independent communications over rows or columns, either for 1-D or 2-D grids, can originatenetwork collisions. For example all slave processes trying to send results to the master process at thesame time. To avoid collisions a set of communication routines using a ring communication pattern,were developed. They allow processes to synchronize by establishing the order of communicationsaccording to the processes position on the grid.

The parallel component TP of the computational model, equation 4.3, represents the operationsthat can be divided over a set of p processors obtaining a speedup of p, thus corresponding tooperations without any sequential part.

TP (n, p) =ψ(n)∑pi=1 Si

(4.3)

The numerator ψ(n) is the cost function of the algorithm measured in floating point operations(flops) as a function of the problem size n. For example, to compute the LU factorization of asquare matrix of size n, the cost is ψ(n) = 2

3n3 [12]. This operation count does not include memory

operations, so that the resulting complexity is therefore higher.The computational complexity for each algorithm is, therefore, obtained by measuring the pro-

cessing capacity achieved for the processors used. Figure 7.1 shows the capacity obtained by a161Mflops computer. It can be observed that the performance is kept almost constant for a rangeof matrix sizes, which is a requirement in order to be able to estimate the processing time of thealgorithms. These measures were made for several block sizes from 10 to 40, resulting in a maximumvariation of the processing time below 6%. From this point on the coefficient of ψ(n) will be referredto as the algorithm constant β defined as β = S/Seq where S is the processor peak performance andSeq is the averaged measure of processing capacity. The value β does not depend on the processorpeak performance but rather is a characteristic of the algorithm.

The denominator of equation 4.3 is the processing capacity used which is obtained by summingthe individual processing capacities of the machines. For this equation to be valid each machineshould not take more than TP seconds to process its part. This assumes a perfect load balancing inthe heterogeneous machine.

5. Load Balancing Strategy. This section presents the new static distribution algorithm,called Group Block, that guarantees the requirement of achieving a balanced load. The load bal-ancing strategy can be either dynamic or static.

Dynamic strategies balance the workload at run-time by transferring load from heavier loadedprocessors to lower ones. These algorithms require a considerable amount of control data, beingsuitable for programs that can be divided in several loosely-coupled tasks [32], or in programs wherethe workload cannot be predicted at compile time. Dynamic strategies for dense linear algebraalgorithms have been addressed in [7, 8], where it was concluded that they cannot achieve theperformance of static strategies due to the dependence constraints among tasks, communicationcosts and control overhead.

Static strategies define a schedule at compile or at launch time based on the knowledge ofthe algorithm structure and on the processor availability. The aim is to define a distribution thatguarantees an evenly balanced load all over the computation, and that no processor is kept idle dueto data dependencies.

The computation environment, or system, is intrinsically dynamic since during a given compu-tation period some processors can become unavailable while others may join the pool of availableprocessors. Therefore, it is important to consider separately the system strategy to deal with itsdynamic behavior, that may lead to a redistribution of load, from the algorithm strategy for loadbalancing, which depends on the algorithm structure. Any combination of both strategies is possi-ble. The work presented in this section refers to the strategy of achieving a good load balancing forinherently sequential algorithms, and therefore, redistribution of data due to machine dynamics areconsidered to be accomplished at a higher level of control.

The distribution algorithm presented next, the Group Block Distribution [3], is a static algo-rithm that optimizes the processing time on heterogeneous machines, by achieving a better loaddistribution, that is by improving the static part of the load balancing strategy.


5.1. Static distribution algorithms for a heterogeneous machine. For a regular dataset the data distribution problem in a heterogeneous machine can be stated as: how to assign Wdata blocks over P processors, having Si Mflop (0 ≤ i ≤ P − 1) of processing capacity in order toobtain the minimum processing time. Assuming that processor i takes ti time units to process oneblock, the optimal load balancing strategy is such that the number of blocks assigned to processori (wi) is governed by keeping wi · ti constant, and the W blocks are distributed in order to enrollequally all processors in each step of the algorithm.

The load balancing strategy is divided in two phases: data assignment and data distribution.The first phase is to guarantee that every processor receives the same amount of ”computationtime”, and the second is to guarantee that processors’ idle times are minimized.

A column-oriented partition and the LU factorization algorithm will be used in the followingdiscussion. In [2] it was concluded that a column-oriented partition optimizes the processing timeof LU when computed in the target computer, a parallel virtual machine of personal computersconnected via Ethernet.

5.2. Group block algorithm. The algorithm presented here aims to improve the processingtime of the LU algorithm by a global and local control of the load. The algorithm is divided in twophases: block assignment and distribution.

5.2.1. Block assignment. The amount of blocks assigned to each processor is obtained basedon the average processing time of a column for each processor. This is a valid option if the nextphase, the distribution, guarantees that all processors have columns all over the matrix in a equalbase.

The number of blocks assigned to each processor is proportional to the relative processingcapacity compared to the entire machine. The load index is defined as li = Si/

∑P−1k=0 Sk (0 ≤ i ≤

P − 1) where Sk is the processing capacity achieved for the algorithm by node k and measured inflops. Thus, for a W block matrix, the number of blocks assigned to processor i is wi = W · li.

This results in an optimal assignment only if the w′is are integer values, which seldom is the

case. To overcome this problem wi is defined as wi = bW · lic. As a consequence there are someremaining blocks, in a number less than P , these being assigned successively to the processor thatensures the minimum computation time in average.

5.2.2. Block distribution. The block distribution operation consists in defining the order bywhich the W data blocks are distributed to processors, in a total of wi (0 ≤ i ≤ P −1) per processor.To accomplish the block assignment requirement as stated in the first paragraph of subsection 5.2.1,the blocks are organized in groups. The number of blocks per group should be small in order tominimize the differences in the computational complexity of blocks, so that the work assigned toprocessors is independent of the column index inside the group. The number of blocks per group is

Compu- S Index Block Datater (Mflops) li count assigned

1800×P0 244 0.3615 27 675P1 161 0.2385 17 425P2 161 0.2385 17 425P3 60 0.0889 6 150P4 49 0.0726 5 125

Table 5.1Data partition for a heterogeneous machine; (1800, 1800) matrix and a block size of 25

defined as GB = b1/min(li)c (0 ≤ i ≤ P − 1). If GB/P < 2 then GB = b2/min(li)c for a processorgrid (1, P ) or (P, 1). The minimum number of blocks is obtained when the slower processor receives1 block, or 2 if GB/P < 2. This condition is imposed to ensure that there is a sufficient number ofblocks in the group.


For the example of Table 5.1, GB = b1/0.0726c = 13. Using the index li the 13 blocks areassigned in number as {5,3,3,1,1} to processor 0 to 4, respectively. TheW data blocks are distributedin groups of 13. Inside each group, the distribution can be {0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 4}, that is:the first 5 blocks to processor 0, the next 3 blocks to processor 1, and so on. For algorithms such asLU factorization, only blocks below the pivot are updated. The global load balancing is guaranteedby the distribution in groups; however, for the group that holds the pivot it is not possible to balancethe workload due to the lack of data. Supposing that the last group is being processed, using thatdistribution after the fifth block the fastest processor would be idle; therefore, it is possible to reducethe processing time if the last blocks are assigned to fastest processors, that is when there is notenough data to balance the workload then it should be the fastest ones doing the work. The loadimbalance mentioned above happens in every group when it holds the pivot block; therefore, theimprovement can have effect from the beginning.

The proposed distribution, coined Group Block, implements this strategy by overloading thefastest processors inside each group. For the case presented, a group of 13 blocks, a possible assign-ment would be {4, 3, 2, 2, 2, 1, 1, 1, 0, 0, 0, 0, 0}, where the numbers represent the processor indexon a grid 1×5 (indices 0 to 4). This is a Group Block linear distribution that starts the distributionfrom the slowest processors.

For bi-dimensional grids, if the relative processor capacities doesn’t allow their balanced con-figuration on the grid, an imbalance distribution can arise. Other data partitioning schemes wereproposed in [4] for the matrix multiplication (MM) algorithm. Contrary to the MM algorithm, forinherently sequential algorithms the amount of data transferred increases if the sequential part ofthe operation is distributed, as would be the case for the pivoting operation. As shown in the resultssection, these algorithms have always a one dimensional grid as the optimal configuration, either arow or a column of processors, and therefore this issue is not discussed further.

5.3. Comparison of static distribution algorithms. The results below present the com-putation of the LU factorization in the machine of Table 5.1, connected by Ethernet at 100Mbit/s,running Linux and PVM 3.4.3. The processing capacity S is the processor peak performance mea-sured in Mflops. Table 5.2 shows the assignment obtained with the algorithm described aboveand the Dynamic Programming Algorithm described in section 3. Both algorithms assign the samenumber of blocks to each processor.

Algorithm Order of distribution (processor index from 0 to P − 1)Group Block 4 3 2 2 2 1 1 1 0 0 0 0 0 4 3 2 2 2 1 1 1 0 0 0 0 0 4 3 2 2 2 1 1 1 0 0

(linear) 0 0 0 4 3 2 2 2 1 1 1 0 0 0 0 0 4 3 2 2 2 1 1 1 0 0 0 0 0 3 2 2 1 1 0 0Dyn. Prog. 0 0 2 1 0 4 3 2 1 0 0 2 1 0 2 1 0 3 0 4 2 1 0 2 1 0 0 2 1 3 0 2 1 0 4 0

Alg. 2 1 0 3 2 1 0 0 2 1 0 4 2 1 0 3 0 2 1 0 2 1 0 0 4 2 1 3 0 2 1 0 0 2 1 0Table 5.2

Data block distribution algorithms for a column oriented partitioning and a matrix of (1800, 1800) with a blocksize of 25

Figure 5.1 (left) compares the performance obtained with the two distributions. The GroupBlock distribution obtains an improvement for all cases shown and, for the 1800 matrix, the im-provement is 8.3% when compared to the dynamic programming algorithm.

Figure 5.1 (right) shows the workload distribution. It can be seen that for the dynamic program-ming algorithm the differences between processors are smaller than for the Group Block distribution,although the latter improves the performance. This means that a controlled imbalance that overloadsthe fastest processors improves the performance. The reason for this behavior is due to the localcontrol implemented by the Group Block distribution, while the other strategy bases the distributiononly on a global control.

In conclusion, for algorithms with the same pattern of computation as LU factorization, thegroup block algorithm gives better results than the dynamic programming algorithm because itimplements a global control as well as a local control, which minimizes the effects of differentcomputational complexities among columns and also reduces the effect of lack of data in some


Dynamic Prog. Alg.Group Block

Matrix dimension

Pro

cess

ing

tim

e(s

)

18001600140012001000

120

90

60

30

0

Dynamic Prog. Alg.Group Block

Processor

Pro

cess

ing

tim

e(s

)

P4P3P2P1P0

80

70

60

50

40

30

20

10

0

Processing time Workload per processorincluding communications excluding communications

Fig. 5.1. LU factorization results

stages of the processing. Both algorithms assign the same number of blocks to each processor, beingthe performance improvement the result of the distribution of blocks.

6. Optimization of the Processing Time. In this section it is presented the method todetermine the ideal number of processors for a given problem, based on the computational modeldescribed in section 4. A parallel version of an algorithm may have two aims: to obtain a betteraccuracy of results by using a more detailed domain, which could not be possible in a single processorusually due to memory limitations, or, for a given accuracy, to obtain a reduction in the processingtime. The time gain obtained with the parallel algorithm is usually called Speedup and is defined asthe quotient of the serial algorithm time (T1) over the parallel algorithm time (TT ).

Speedup =T1

TT(6.1)

Depending on how the serial processing time is measured one can have different definitions ofSpeedup. Relative Speedup, Real Speedup or Absolute Speedup [26]. In the context of the envisagedapplications for the parallel virtual machine, we define Speedup as the ratio between the processingtime of the serial version in the computer that controls the parallel execution (master), over theprocessing time of the parallel program. This is the effective gain as seen by the user, who has achoice between his/her own single machine (master) or the parallel virtual machine. The aim inscheduling work for distributed processing is to obtain a processing time that is as small as it can beobtained for that particular network, even if some of the nodes are left in the idle state. Therefore,the relevant parameter to be considered is the Speedup in detriment of the Efficiency, which is oftenused in other contexts.

The goal of the work in this section is the determination of the optimum number of processorsusing a criterion of minimum processing time.

6.1. Application to a homogeneous machine. For a given algorithm, characterized by theconstant β, and for size n matrices, p can be obtained by solving equation 6.2 in order to p [1].

∂TT∂p

= 0(6.2)

For a homogeneous machine equation 4.3 simplifies to TP (n, p) = ψ(n)pS and the communication

parameters TL and TE assume the same value for all machines, allowing a straightforward solution.

6.2. Application to a heterogeneous machine. For a heterogeneous machine another de-gree of complexity is added to equation 6.2: first, processors have different computational capacities(S) and second, the communication parameters TL and TE also vary with S, as shown in Table 4.1.

To tackle this problem one first orders the nodes by decreasing value of Si (the capacity of nodei), and then schedules the work from the fastest to the slowest free node, resulting the denominator


of equation 4.3: ST (p) =∑p

i=1 Si. To compute the first derivative of TT in order to p it is requiredto find the sum ST (p), which cannot be computed beforehand since one does not know how manyprocessors will be used. The function ST (p) increases monotonically with p, having a growth ratethat decreases with increasing p, as shown in figure 6.1 for a machine composed by processors ofcapacities {244, 244, 161, 161, 60, 50, 49} Mflop/s in decreasing order.

Squad 2a itSquadSreal

M=244,244,161,161,60,50,49

Number of processors

Mflo

p/s

7654321

1200

1000

800

600

400

200

Fig. 6.1. Processing capacity of the heterogeneous machine as a function of the processors used

The aim is to approximate ST (p) (or Sreal) by a polynomial function in p in order to be ableto solve equation 6.2. A first order polynomial function as used for a homogeneous machine is notadequate here. The ideal polynomial function would be one that passes by all points of ST (p);however, its computation time may be significant for a large number of processors. The solutionadopted was an iterative quadratic approximation. The first function is defined by zero and theextreme points of ST (p). The iterative process allows the reevaluation of the cost function TT (n, p)in the neighborhood of the solution computed. In each iteration only half of the processors used inthe last iteration are considered, being the polynomial function defined by: if P is the total numberof processors, p(i−1) the solution for iteration (i − 1) then in iteration i the function is defined bythe three points ST (p(i−1)) and ST (pi−1 ± P/2i+1) for i = 1, 2, . . ..

The second degree polynomial function has the same behavior as ST (p) and is written as:

PS(p) = ap2 + bp+ d(6.3)

resulting for the first derivative of TP (n, p) in order to p:

∂TP (n, p)∂p

=∂

∂p

(ψ(n)

ap2 + bp+ d

)= 0(6.4)

which must be solved in order to obtain the number of processors p that minimizes the total pro-cessing time.

If the logical grid of processors affects the processing time, then changing to a 2D grid (e.g. (r, c)grid) or to a 3D grid (e.g. hypercube), one or two dimensions are added to the problem respectively.For the 2D grid the quadratic approximation with p = rc becomes PS(rc) = a(rc)2 + b(rc) + d.

The communication parameters TL and TE also need to be modeled by a function of p in orderto solve ∂TS(n, p)/∂p. To transmit a message from computer A to B the latency and packingtime depend on the speed of processor A. If one can predict the amount of data each processorwill be responsible to transmit, one can estimate the time spent in communications by the wholemachine. According to the data distribution algorithm each processor is allocated an amount ofdata proportional to its relative speed in the heterogeneous machine: li = Si/

∑pk=1 Sk. Therefore,

functions to model these parameters are defined by equation 6.5, corresponding to an weightedmean of these values for each possibility of p processors. The values of (TL)i and (TE)i are shownin Table 4.1.

TTL(p) =∑p

i=1 li × (TL)i and TTE(p) =∑p

i=1 li × (TE)i(6.5)


For the machine considered (figure 6.1), the functions TTL(p) and TTE(p) are shown in figures6.2 (left) and 6.2 (right) respectively. In those figures it is also shown a first degree polynomialapproximation to be included in TS(n, r, c).

Linear (TTL)TTL

Latency Time


us/b

yte

86420

120

110

100

90

80

70

60

Linear (TTE)TTE

Packing Time


us/b

yte

86420

0.08

0.07

0.06

0.05

0.04

Approximation for TTL(p) per byte Approximation for TTE(p) per byte

Fig. 6.2. Approximation functions for communication parameters

The (r, c) configuration that minimizes the processing time is obtained by ∇TT (n, r, c) = 0.Since one wants to compute the ideal grid (r, c) for a given problem size n, the first derivative ofTT (n, r, c) in order to n is zero. Thus, the optimal configuration is obtained by solving the systemof equations 6.6.

{∂TT (n,r,c)

∂r = 0∂TT (n,r,c)

∂c = 0(6.6)

6.3. Applying the methodology to the LU factorization. The methodology presentedabove will be tested with the parallel implementation of the LU factorization described in section2.2.

Considering a grid (r, c) of processors and a square matrix of size (n, n) the amount of data(double/float) transmitted in the parallel matrix update is (r+ c− 2)n

2

2 and the parallel processingtime is TP (n, r, c) = βLU

n3

ST (rc) + Θ(n2) ' βLUn3

ST (rc) , where βLU = 7. There is a negligible compo-nent of complexity n2 corresponding to the computations made by the pivot processor. ST (rc) isthe processing capacity of the heterogeneous machine when rc processors are used. The total esti-

mated processing time is expressed as TT (n, r, c) =(

n22 (r+c−2)

packetsize

)TTL(rc) + (n

2

2 (r+ c− 2))(LB−1 +

TTE(rc)) + β n3

PS(rc) . Depending on the data types used (float or double) the correspondent commu-nication factors have to represent the amount of data in bits. LB is the bandwidth measured inMbit/s.

For the machine of figure 6.1 the quadratic approximation, equation 6.3, becomes PS(rc) =−17.595(rc)2 + 261.6rc. This approximation is close to the real curve ST (rc) for values 0 ≤ rc ≤ 7.Outside this domain the polynomial function may introduce false minimum in the processing timefunction. Therefore, the minimization must be restricted to the allowed domain by the number ofprocessors available. This can be accomplished by introducing the Lagrange multipliers [23] in thesystem of equations 6.6. An additional function to restrict the domain is included:

∂TS(n,r,c)∂r + ∂TP (n,r,c)

∂r = −λc∂TS(n,r,c)

∂c + ∂TP (n,r,c)∂c = −λr

λ(rc − 7) = 0(6.7)

Figure 6.3 (left) shows the processing time estimated and measured for grids (1, c) and for a matrixof size 1800 and limited to 7 processors. In the figure Tquad is the total processing time TT (n, r, c)obtained by estimation with the quadratic approximation for machine processing capacity, Treal byestimation using the exact processing capacity and Tmeas is the measured processing time. Trealmodels correctly the measured time (Tmeas), with a maximum relative error of 2.6% (c = 6), meaningthat the computational model is accurate. In both cases it can be observed that the ideal grid is


(1, 4). However, the optimal grid obtained with Tquad is (1, 5) which differs from the measured value.A second iteration is required in order to obtain the correct solution. In this iteration only half theprocessors are considered (3) being the quadratic approximation centered in the former solution(c = 5) as shown in figure 6.1. Once fewer processors are considered, the quadratic approximationis closer to the real curve. Solving again the system of equations 6.7, the correct solution c = 4 isobtained.

The iterative process allows a reevaluation of the processing time function using, from iterationto iteration, a more accurate quadratic approximation which leads to the optimal solution.

TquadTreal

Tmeas

Columns of processors

Pro

cess

ing

tim

e(s

)

765432

110

100

90

80

70

60

50

M 1200E 1200M 1800E 1800


Com

m.

tim

e(s

)

765432

25

20

15

10

5

0

Processing time for a Estimated (E) and measured (M)matrix of size 1800 communications

Fig. 6.3. Comparison of estimated and measured times for LU factorization

Figure 6.3 (right) shows the estimated (E) and measured (M) communication times for matricesof size 1200 and 1800. In general the communications are satisfactorily modeled. These times areincluded in the total processing time of figure 6.3 (left).

7. Results. In this section results are presented for tridiagonal reduction (TRD), QR factor-ization and the symmetric eigenvector algorithm, in the heterogeneous machine represented in figure6.1 connected by an Ethernet network at 100Mbits/s. Figure 7.1 shows the performance of eachalgorithm in a single processor. The QR performance is divided by 2 for displaying purposes. Asshown before for the LU algorithm, the processor performance is kept almost constant for the blockversions of these algorithms, for matrices greater than 400 elements. The correspondent β valueconsidered for each algorithm is the average in that domain. The square block size used in all casesvaries from 16 to 40.

The aim is to show that using the system of equations 6.7 with the quadratic approximation forthe processing capacity, one is able to estimate the ideal grid of processors for a given algorithm andproblem dimension. In all cases it is applied the Group Block distribution algorithm to guaranteethe balanced load required. The estimated values presented below, Tquad and Treal, are obtainedusing the quadratic approximation and the real values for processing capacity, respectively. Tmeasrepresents the measured processing times.

QR/2TRD

LU

Matrix dimension (n)

Mflo

p/s

1400120010008006004002000

19

17

15

13

11

9

7

5

Fig. 7.1. Performance of LU, QR and TRD block algorithms on a 161 Mflop/s peak performance processor asa function of matrix size


7.1. Tridiagonal reduction algorithm. The tridiagonal reduction algorithm (TRD) is a stepin the computation of the eigenvalues and eigenvectors of a symmetric matrix. Algorithm details canbe found in [10]. For a (r, c) processor grid, the amount of data to transmit is 2n(r−1)+4n2(rc−1),for computation and broadcast of Householder vectors, parallel matrix update and matrix vectorproducts. The parallel processing time is TP (n, r, c) = βTRD

n3

ST (rc) +Θ(n2), where βTRD = 28; thereis also a negligible term in n2. Figure 7.2 (left) shows the processing time for a matrix of size 1200.Treal models correctly the measured processing time (Tmeas) having a maximum error of 3.3% forc = 4. Tquad presents the same behavior as Treal but higher errors for c up to 4 which coincides withthe behavior of the quadratic approximation. The minimum is correctly determined as grid (1,4) inthe first iteration of the optimization algorithm.

TquadTreal

Tmeas


Pro

cess

ing

tim

e(s

)

7654321

220

190

160

130

100

70

40

M 800E 800

M 1200E 1200


Com

m.

tim

e(s

)

765432

30

25

20

15

10

5

0


Fig. 7.2. Comparison of estimated and measured times for Tridiagonal Reduction algorithm

Figure 7.2 (right) shows the estimated (E) and measured (M) communication times for matricesof size 800 and 1200. The most significant differences are observed for the matrix of size 1200where communications are overestimated. In all cases the difference is below 2 seconds which is notsignificant in the total processing time (figure 7.2 (left)).

7.2. QR iteration algorithm. The QR iteration [15] is the last step in the eigenvector com-putation sequence, preceded by the tridiagonal reduction of a symmetric matrix and orthogonalmatrix computation.

The parallelization implemented takes advantage of the fact that one rotation updates only twocolumns without inter-row dependencies and therefore a row oriented distribution was adopted.

The QR iteration has two computational tasks: one, to do the bulge chase of order n2, and theother to update the orthogonal matrix of order n3. The parallel processing time is TP (n, r, c) =βQR

n3

ST (rc) + Θ(n2), where βQR = 57. The time to compute the chases is in fact negligible whencompared to the Θ(n3) term.

Figure 7.3 (left) shows the estimated and measured processing time for a matrix of size 1000.Treal models correctly Tmeas having a maximum relative error of 2.9% for r = 4. Tquad overestimatesthe processing time up to r = 4 being more accurate for a higher number of processor rows. Thisbehavior is expected since the quadratic approximation has a higher error for r up to 4. The optimalgrid (1, 7) is obtained in the first iteration.

The communications involved are only to distribute the Givens rotations, estimated assuminga convergence rate of γ, as γn2(r− 1). This is an estimation because the number of chases dependson the rate of convergence of the QR iteration. This rate is expected to be less than 2 [12]. Theestimated values of figure 7.3 (right) were computed with a value γ = 0.9 obtained experimentally forthe matrix used. In this algorithm the communication parameters TE and TL refer to the machinethat computes the Givens rotations, since it is the only emitter in the QR iteration.

7.3. Symmetric eigenvector computation. In this subsection the whole algorithm foreigenvector computation executed in the heterogeneous machine is compared to a serial version[24] when executed in the fastest node.


TquadTreal

Tmeas

Rows of processors

Pro

cess

ing

tim

e(s

)

7654321

100

80

60

40

M 1000E 1000M 1600E 1600

Rows of processors

Com

m.

tim

e(s

)

765432

25

20

15

10

5

0


Fig. 7.3. Comparison of estimated and measured times for QR iteration

The performance metrics used to evaluate the parallel application are, first, the runtime, andsecond the speedup achieved. To have a fair comparison in terms of speedup, one defines theEquivalent Machine Number (EMN(p)) which considers the power available instead of the numberof machines that, for a heterogeneous environment, constitutes an ambiguous information. Equation7.1 defines EMN(p) for p processors used, and S1 is the computational capacity of the processorthat executed the serial code, also called the master processor.

EMN(p) =∑p

i=1 SiS1

(7.1)

For the machine presented in figure 6.1 EMN(6) = 3.77 and EMN(7) = 3.97, which meansthat 6 processors of the heterogeneous machine are equivalent to 3.77 processors identical to themaster processor and that 7 processors are equivalent to 3.97.

Figure 7.4 (left) and figure 7.4 (right) compare the virtual machine to the fastest node of themachine, which was used to run the sequential code. Different grid configurations are used for eachalgorithm, according to the optimal grid computed by equation 6.7.

It is observed that for matrices of size 600 or higher, the heterogeneous efficiency (EH ) is betterthan 45%. For a matrix of size 800, EH = 61% with 3.8 equivalent processors. In [16] an efficiency of57.9% was obtained with 4 vector processors of the distributed memory computer Fujitsu VPP300,using the same algorithm for the eigenvector computation of a matrix of size 802.

In terms of parallelization quality, the results displayed are in the range of values obtained withdedicated machines. However, in terms of processing time the virtual machine cannot be comparedto a high performance parallel computer. For the above experience the virtual machine took 97.7seconds and the Fujitsu took 1.783 seconds.

Fig. 7.4. Results of eigenvector computation for a heterogeneous machine; left: eigenvector computation in a7 processor heterogeneous machine compared to the sequential algorithm executed in the fastest node; right: gridselection, speedup and heterogeneous efficiency


8. Conclusions. In this paper, it was presented a methodology to run efficiently dense linearalgebra algorithms on Ethernet based clusters. Although only few algorithms were considered,namely LU factorization and the symmetric eigenvector algorithm, they are applied extensively inmany application domains.

The methodology is based on a static distribution algorithm that schedules the work so thatfastest processors are preferred when there is not enough data to process, which is a characteristicof inherently sequential algorithms. The number of processors to use in a computation is obtainedby an optimization algorithm based on the computational model of the heterogeneous cluster. Thismethodology proved to be valuable for environments where the network is a bottleneck of the system,such as a bus oriented architecture. The computation of the optimum number of processors can bealso useful in the bus oriented homogeneous architecture.

The methodology presented in this paper was designed to address problems arising in the contextof using image processing and analysis algorithms for interactively extracting important data andinformation from images of a specific application domain, namely medical imaging.

Currently, this activity is often conducted by exploring the functionality (hardware and soft-ware) of general purpose systems, which usually trade off algorithm sophistication and user comfort;this means that more advanced image tools may be absent in these systems due to practical consid-erations.

The main goal of the work herein presented was to take advantage of the existence of a networkof computers (this is a very frequent situation in many user organizations) to try and move theaforementioned trade-off in the direction of allowing the provision of more advanced and sophisticatedalgorithms without sacrificing user comfort.

The results presented show that, for the important linear algebra building blocks of manyadvanced image analysis methods, the stated goal may be accomplished; an improvement has beenachieved in the execution time, by a factor of about 3, which may bring more image analysis toolsinto the feasible condition for new general-purpose software.

A collection of machines with a wide range of processing capacities, from 244 to 49 Mflop/s inthe case presented, can cooperate and achieve a considerable speedup in linear algebra algorithms.The main objective is that the user of a computationally demanding application may benefit fromthe computational power distributed over the network, while keeping other active users undisturbed.This goal can be achieved in a transparent manner for the user, once the modules of his/her appli-cation are correctly parallelized for the target network and the performance of the machines in thenetwork is known. The application, before initiating a parallel module, determines the best availablecomputer composition for a parallel virtual computer to execute it, and then launches the module,achieving the best response time possible in the actual network conditions.

Practical tests of the methodology were conducted for heterogeneous networks, using basic algo-rithms from linear algebra; the theoretical values computed were closely confirmed by the measuredperformance. Other generic modules will be parallelized and tested, so that an ever increasing num-ber of image analysis methods may be assembled from them. Application domains other than imageanalysis may also benefit from the proposed methodology.

REFERENCES

[1] J. Barbosa and A. Padilha, Algorithm-dependent method to determine the optimal number of computers inparallel virtual machines, VECPAR’98, vol. 1573, Springer-Verlag, 1998, 508-521.

[2] J. Barbosa, J. Tavares, and A.J. Padilha, Linear algebra algorithms in a heterogeneous cluster of personalcomputers, Proceedings of 9th Heterogeneous Computing Workshop, IEEE CS Press, May 2000, 147-159.

[3] A group block distribution strategy for a heterogeneous machine, Proceedings of AI Int. Symposium on Paralleland Distributed Computing and Networks, IASTED, February 2002, 378-383.

[4] O. Beaumont, V. Boudet, F. Rastello, and Y. Robert, Matrix multiplication on heterogeneous platforms, IEEETransactions on Parallel and Distributed Systems, 12, (2001), 1033-1051.

[5] F. Berman, R. Wolski, S. Figueira, J. Schopf, and G. Shao, Application-level scheduling on distributed hetero-geneous networks, Supercomputing 96, 1996.

[6] V. Boudet, Heterogenous task scheduling: a survey, Tech. Report 2001-07, LIP-Lyon, http://www.ens-lyon.fr/LIP/lip/publis/publis.us.html, February 2001.

[7] V. Boudet, F. Rastello, and Y. Robert, A proposal for a heterogeneous cluster scalapack (dense linear solvers),Proceedings of PDPTA, CSREA Press, 1999.


[8] P. Boulet, J. Dongarra, F. Rastello, Y. Robert, and F. Vivien, Algorithmic issues on heterogeneous computingplatforms, Parallel Processing Letters, 9, (1999), 197-213.

[9] J. Choi, J. Dongarra, L. Ostrouchov, A. Petitet, D. Walker, and R. Whaley, The design and implementation ofthe scalapack lu, qr and cholesky factorization routines, Tech. Report LAPACK Working Note 80, ORNL,September 1994.

[10] J. Choi, J. Dongarra, and D. Walker, The design of parallel dense linear software library: Reduction to hessen-berg, tridiagonal and bidiagonal form, Tech. Report LAPACK Working Note 92, University of Tennessee,Knoxville, January 1995.

[11] D. Culler, R. Karp, D. Patterson, A. Sahay, K. Schauser, E. Santos, R. Subramonian, and T. von Eicken, Logp:Towards a realistic model of parallel computation, 4 ACM SIGPLAN Symp. on Principles and Practice ofParallel Programming, 1993.

[12] J.W. Demmel, Applied Numerical Linear Algebra, SIAM, 1997.[13] J. Dongarra and D. Walker, The design of linear algebra libraries for high performance computers, Tech. Report

LAPACK Working Note 58, University of Tennessee, Knoxville, June 1993.[14] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam, Pvm 3 user’s guide and reference

manual, Tech. Report TM-12187, ORNL, September 1994.[15] G. Golub, Matrix Computations, The Johns Hopkins University Press, (1996).[16] D.L. Harrar and M.R. Osborne, Solving large-scale eigenvalue problems on vector parallel processors, VEC-

PAR’98, vol. 1573, Springer-Verlag LNCS, 1998, 100-113.[17] G. Henry, D. Watkins, and J. Dongarra, A parallel implementation of the nonsymmetric qr algorithm for

distributed memory architectures, Tech. Report Technical Report CS-97-352 and LAPACK Working Note121, University of Tennessee, March 1997.

[18] R. Hockney and C. Jesshope, Parallel Computers 2, Adam Hilger, Ltd, Bristol, UK, 1992.[19] J. JaJa and K. Ryu, The block distributed memory model, Tech. Report CS-TR-3207, University of Maryland,

January 1994.[20] Y.-K. Kwok, Parallel program execution on a heterogeneous pc cluster using task duplication, Proceedings of

9th Heterogeneous Computing Workshop, IEEE CS Press, May 2000, 364-374.[21] Y.-K. Kwok and I. Ahmad, Dynamic critical path scheduling: An effective technique for allocating tasks graphs

to multiprocessors, IEEE Transactions on Parallel and Distributed Systems, 7, (1996), 506-521.[22] M.J. Litzkow, M. Livny, and M.W. Mutka, Condor - a hunter of idle workstations, Proc. IEEE 8th International

Conference on Distributed Computing Systems, IEEE CS Press, 1988, 104-111.[23] J.E. Marsden and A.J. Tromba, Vector Calculus, W.H. Freeman and Company, (1981).[24] W. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery, Numerical Recipes in C: The Art of Scientific

Computing, Cambridge University Press, (1997).[25] E. Rosti, E. Smirni, L. Dowdy, G. Serazzi, and K. Sevcik, Processor saving scheduling policies for multiprocessor

systems, IEEE Transactions on Computers, 47, (1998).[26] S. Sahni and V. Thanvantri, Performance metrics: Keeping the focus on runtime, IEEE Parallel & Distributed

Technology, (1996), 43-56.[27] B. Shirazi, M. Wang, and G. Pathak, Analysis and evaluation of heuristic methods for static task scheduling,

Journal of Parallel and Distributing Computing, 10, (1990), 222-232.[28] C. Spurgeon, Ethernet Configuration Guidelines, Peer-to-Peer Communications, Inc, 1996.[29] A. Steen, Scaling and computational similarity, Tutorial in VECPAR’98, 3rd International Meeting on Vector

and Parallel Processing (Systems and Applications), Porto, 1998.[30] H. Topcuoglu, S. Hariri, and M.-Y. Wu, Performance-effective and low-complexity task scheduling for heteroge-

neous computing, IEEE Transactions on Parallel and Distributed Systems, 13, (2002), 260-274.[31] L.G. Valiant, A bridging model for parallel computation, Communications of the ACM, 33, (1990), 103-111.[32] J. Watts and S. Taylor, A practical approach to dynamic load balancing, IEEE Transactions on Parallel and

Distributed Systems, 9, (1998), 235-248.

Documents

OPTIMIZING DENSE LINEAR ALGEBRA ALGORITHMS ON ...tavares/downloads/publications/...out [7, 8] and the strategy adopted consists in the design of a static data distribution algorithm