A gracefully degrading massively parallel system using the BSP model, and its evaluation

38 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 1, JANUARY 1999

A Gracefully Degrading Massively ParallelSystem Using the BSP Model,

and Its EvaluationAndreas Savva, Member, IEEE, and Takashi Nanya, Senior Member, IEEE

Abstract—The Bulk-Synchronous Parallel (BSP) Model was proposed as a unifying model for parallel computation. By usingRandomized Shared Memory (RSM), the model offers an asymptotically optimal emulation of the Parallel Random Access Machine(PRAM). By using the BSP model with RSM, we construct a gracefully degrading massively parallel system using a fault tolerance(FT) scheme that relies on memory duplication to ensure global memory integrity and to speed up the reconfiguration. After a faultoccurs, global reconfiguration restores the logical properties of the system. Work done during reconfiguration is shared equallyamong the live processors, with minimal coordination. We analyze, at the level of the BSP model, how the performance of a systemmay change as processors fail and the performance of the interconnection network degrades. We relate the change in overallsystem performance to the change in computation and communication load on the live processors. Further, we show how toestimate the overhead imposed by the FT scheme. We evaluate the reconfiguration time, the overhead, and graceful degradation ofthe system experimentally by an implementation on a Massively Parallel Processor (MPP). We show that the predictions about thedegradation of the system and the overhead cost of the scheme are accurate.

Index Terms—BSP model, graceful degradation, fault tolerance, memory duplication, MPP, PRAM, RSM.

——————————��F��——————————

1 INTRODUCTION

HE computational power required to attack problemsdefined in the Grand Challenges [1] does not yet exist.

In the quest for TERAFLOPS processing power and be-yond, generally agreed necessary to approach problems likeclimate or semiconductor modeling, the most promisingapproach is the Massively Parallel Processor (MPP).

A key problem with MPPs, however, is reliability. Thehuge number of components that make them up, even ifthey have an extremely long Mean Time Between Failure(MTBF), means that the MTBF of the entire system will below, perhaps counted in terms of hours.

The sheer redundancy an MPP exhibits, however, shouldmake it possible to incorporate fault tolerance mechanismsthat ensure continuing operation, albeit at lower perform-ance. Hardware fault tolerance methods alone, though es-sential, are not sufficient to guarantee a gracefully degrad-ing system. Therefore, there has been a lot of attentiongiven to software methods. Two main approaches can bediscerned. One is using general methods that work inde-pendently of the program. Examples of this approach rangefrom the architecture specific, e.g., using virtual hypercubes[2], to using programming models such as functional lan-guages [3], to more general techniques, for example, roll-back recovery [4]. The second approach is to include some

mechanism tailored to the program itself. Examples arealgorithm-based fault tolerance [5] or application-specificself-stabilization [6].

General methods have the advantage of being applicableto any program, without additional work. But, they mayhave higher overheads in fault-free situations since they arenot tailored to the application. In both kinds of approaches,general or algorithm-specific, the model used is either ma-chine dependent, e.g., the interconnection network, or toohigh level, e.g., a programming language or environment.Therefore, it is not feasible to evaluate the behavior of anapproach without implementing it on each machine. Forexample, if the model of the system is the interconnectionnetwork, many low level features need to be taken into ac-count and the task of evaluating the resulting system is toocomplex [2]. If, on the other hand, a high level language isused, there is not enough information about the cost of op-erations to predict the behavior of the system.

We present an approach to graceful degradation for mas-sively parallel processing systems that adopts the Bulk-Synchronous Parallel (BSP) model, a high-level machine in-dependent and language programming neutral model, that isalso a cost model. Building a fault-tolerant system using sucha model simplifies evaluation. Machine specific features arerepresented in a high level manner, while program costs,such as communication and computation costs, can be repre-sented precisely. Hence, the impact of the fault tolerance ap-proach on a program can be gauged more accurately.

The BSP model was shown to offer an optimal emulationof the Parallel Random Access Machine (PRAM) model. Weassume a PRAM language model on top of the BSP modeland use the key features of the emulation, such as the

0018-9340/99/$10.00 © 1999 IEEE

²²²²²²²²²²²²²²²²

•� A. Savva is with Fujitsu Ltd., Kawasaki 211-8588, Japan.E-mail: [email protected].

•� T. Nanya is with the Research Center for Advanced Science and Technol-ogy, University of Tokyo, Tokyo 153-8904, Japan.�E-mail: [email protected].

Manuscript received 12 June 1996; revised 10 Feb. 1998.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number 102032.

T

SAVVA AND NANYA: A GRACEFULLY DEGRADING MASSIVELY PARALLEL SYSTEM USING THE BSP MODEL, AND ITS EVALUATION 39

global memory randomization, to include sufficient redun-dancy to allow the system to recover when a single proces-sor fails. Once a processor fails, reconfiguration is carriedout. After reconfiguration is done, the system is able towithstand another processor failure. Provided there isenough combined memory in the live processors to keepthe contents of the global memory, the system can stillfunction even after p - 1 processor failures and reconfigu-rations have occurred, where p is the initial number ofprocessors.

Further, an evaluation method for fault toleranceschemes defined on the BSP model is proposed. As the BSPmodel is a cost model, we can determine what is the bestwe can do in terms of a gracefully degrading system andunder what conditions we can achieve it. We relate the deg-radation that an “ideal” system is capable of to the maxi-mum change in the computation or communication loadthat any node experiences. This way of looking at the deg-radation is similar to the idea of normalized load used topredict the behavior of hypercubes under multiple faults[2]. The problems faced are different however. In [2], struc-tural properties of the program, like the data distribution,optimized specifically for a hypercube, and machine spe-cific properties, like message routing, have to be taken intoaccount explicitly. Instead, we can focus on global proper-ties of the network, such as putting a bound on how muchwe can allow the network to degrade if we are to ensure abalanced system.

Even though the approach we present is not tailored tospecific algorithms or machines, the overhead introducedduring normal operation is not fixed but will depend on theactual algorithm being executed. The maximum overheadthat may become visible can be identified for each algorithmthrough our evaluation method and can be controlled, pro-vided the granularity of the algorithm can be changed. Weexperimentally evaluate both the overhead, and the degra-dation behavior of the our fault-tolerant system under anumber of faults by an implementation on an MPP.

Finally, note that it may be possible to use the approachdescribed in this paper with other parallel cost models suchas the LogP [7] model. The LogP model is similar to the BSPmodel, but models communication costs in more detail thanthe BSP model.

The rest of the paper is organized as follows: The nextsection introduces the BSP model and the key features ofthe PRAM emulation. Section 3 discusses the fault modelassumed and describes the FT scheme, in particular theway memory is distributed in the system, and the waysystem reconfiguration is carried out. Section 4 discussesthe evaluation of gracefully degrading systems defined onthe BSP model and then applies the evaluation to the FTscheme presented in this paper. Section 5 presents experi-mental results from an implementation on an MPP. Section 6is the conclusion.

2 THE BULK-SYNCHRONOUS PARALLEL MODEL

The BSP model was introduced by Valiant [8] as a way ofmodeling parallel machines and developing portable par-allel applications. It was envisaged as a model that could be

efficiently mapped to a specific parallel machine, whileparallel programming languages could be mapped effi-ciently to it. It is described as a combination of three attrib-utes: a number of processor or memory nodes, a router thatdelivers messages point to point among the nodes, and fa-cilities to synchronize all, or a subset of, the nodes at regu-lar intervals of time. In this paper, we consider only the caseof a node containing both processor and memory, and weuse the terms processor and node interchangeably.

The parameters that characterize a BSP machine are thefollowing:

p The number of nodes.

L The minimal time between successive synchronizations,or the synchronization rate of the machine. Also, L re-flects the granularity at which computation can be effi-ciently carried out on a certain machine.

g The ratio between the total number of local operationsperformed by all the processors per second and the totalnumber of words delivered by the communication net-work per second. It is a measure of how efficient the ma-chine is in communication. It determines the rate ofcommunication permitted in the program, if the pro-gram is to be executed efficiently.

In addition, the BSP model, to ensure optimal operation,requires the program to have a certain amount of excessparallelism. For example, if there are p physical processorsin the system, then the program should have been writtenas if at least p log p physical processors existed. This extraparallelism is used to hide the communication latency.

Computation in the BSP model proceeds in a sequence ofsupersteps, with all nodes synchronizing, barrier style, at theend of each superstep. A superstep is made up by eithercommunication or computation operations or both and hasa minimum duration, denoted by parameter L. Computa-tion is carried out on data held locally before the start of thesuperstep, i.e., the computation part of a superstep is inde-pendent of the communication part of the same superstep.Communication operations, such as read and write,started in a superstep will complete by its end. Hence, datarequested or updated during a superstep will only be avail-able at the beginning of the next superstep. The bottom partof Fig. 1 shows this activity.

It is assumed in this paper that read and write opera-tions are implemented using message primitives, like sendand recv, and not direct memory access primitives, such asput and get. Both read and write operations are made upof two messages—a request message and a reply message.Throughout this paper, a message refers to this kind of mes-sage: a read or write operation, request, or reply message.

Messages making up communication operations such asread and write are sent in bulk in what are called h-relations. Briefly, an h-relation is a routing problem whereeach processor sends at most h messages and receives atmost h messages. Such an h-relation, under the BSP model,is assigned a cost gh, where h is the maximum number ofmessages any single node has to send or receive and g is thecomputation to communication ratio of the machine. If themaximum computation cost any node is assigned to carry outin superstep S is wS, and if hs and hr stand for the maximum


number of messages sent or received by any node, then thetotal cost of S, CS, is defined as:

DEFINITION 2.1 (Superstep Cost).

C L w gh ghS S s r= max , , ,< A .

The cost of an algorithm is the sum of the costs of all itssupersteps.

The above cost definition is appropriate when communi-cation and computation are overlapped in a superstep andsynchronization is carried out periodically, as in the originalBSP definition [8]. Note that other superstep cost defini-tions [9], [10] are possible but do not give significantly dif-ferent results than the one above.

Two distinct modes of operations of the BSP model wereenvisaged: the automatic and the direct mode. In the automaticmode, the global memory distribution is automatically man-aged through randomization or hashing at the memory wordlevel. That is, each word of the global memory is assigned to aprocessor using a hash function. In this paper, we call this kindof memory Randomized Shared Memory (RSM).

In the direct mode, on the other hand, the programmer isresponsible for the data distribution [9] and, therefore, forensuring that accesses are evenly distributed. The directmode is more appropriate when g is not low enough orwhen even small constants in the emulation need to beeliminated.

In this paper, we will assume the automatic mode of op-eration throughout. The reason for using hashing in theautomatic mode is as follows: From the cost equationabove, the communication cost of a superstep depends onthe maximum number of messages any node sends or re-ceives. Therefore, it is necessary to have a guarantee that foran arbitrary program an approximately even distribution ofmemory accesses can be achieved. The most promisinggeneral technique for achieving an even distribution ofmemory accesses is memory randomization or hashing.

In [11], it was shown that there do exist universal hash-ing functions that, with high probability, guarantee theproperty of approximately even distribution, but they re-quire O(log p) computation steps and are not one-to-one. Inthis paper, for simplicity, we will use linear hash functionsof the form h(A) = (bA + c) mod M, where h(A) is the hashedaddress, A the logical address, M the global memory size,and c and b some constants. An extra requirement is that band M must be relatively prime. The hashed memory ad-dress, h(A), may be assigned to some node by using thefunction h(A) mod p, where p is the number of nodes in thesystem. Such a hash function does not, however, guaranteean even distribution of memory accesses. Nonetheless, it iseasier to calculate and probably sufficient in most cases.

An alternative linear function that also has the advan-tage of being two-universal is m(A) = A/(M/p) [12], wherem(A) is the node in which A will be stored. As before, A is

Fig. 1. A superstep.


the logical address, M the global memory size, p the num-ber of nodes in the system. There is a restriction that Mmust be a multiple of p. Since the number of nodes, p,changes when nodes fail, this restriction makes it difficultto use this function as it is difficult to maintain the relationbetween M and p.

We might still be able to use the two-universal functionin the following way: The BSP model requires that a pro-gram that will be run on p processors is written as if it willrun on v processors, where v � p log p. Since v will notchange, even when nodes fail, we can use instead m(A) =A/(M/v) for the hash function.

In any case, note that the approach described in Section 3does not depend on a specific class of hash functions. Ituses the hashed addresses, but is not tied to the way thehashing is computed. Therefore, any hash function thatdoes not put limitations on the values of p should be usable.

One important result for the automatic mode is that itcan offer an asymptotic emulation of the PRAM, providedthat g is sufficiently low, O(1) [8]. For example, each PRAMprocessor can be thought of as a thread, or task, Vi. Pro-vided the program is manipulated so that all data requiredfor a superstep is prefetched, a possible view of the emula-tion is shown in Fig. 1. Each virtual processor Vi is mappedto some physical processor Pj. Each Pj holds a number of Vthreads, at least log p of them, where p is the number ofphysical processors. At the beginning of each superstep,each P, sends all the read requests for data that its Vthreads will use in the next superstep. After all the requestsare sent, P carries out the computation of each V. All dataused in this computation are already available locally.During computation, write operations are carried out, butthey are not guaranteed to complete until the end of thesuperstep. When all the computation is done, and all thewrite and read operations of all processors are satisfied,the barrier is complete and the next superstep begins.

It may not always be possible, due to data dependencies,to prefetch data in this manner. In that case, we might needto either have supersteps that consist only of communica-tion operations, or use a different synchronization mecha-nism instead of the barrier. In the original BSP paper, othermethods of synchronization were not precluded. This pointis discussed further in Section 5.

3 THE FAULT TOLERANCE SCHEME

We discuss the fault model assumptions, describe the sys-tem’s memory distribution, and show how system recon-figuration can be accomplished after a node fails.

3.1 Fault-Model AssumptionsWe make the following assumptions:

1)�The system is made up of fail-stop processors [13].2)�At any point, only a single processor may fail and,

until the system is reconfigured, there will be no otherfault occurring, i.e., the single-fault model.

3)�Processor failures have no effect on the interconnec-tion network or, more precisely, the value g of the ma-chine does not change because of the faults.

4)�There is enough memory in each node to keep thepart of the global memory assigned to it, see Section 3.2,even after a number of faults occur.

5)�The unit of computation is the task and a task doesnot contain any data that needs to be preserved. Allimportant data is in the global memory.

6)�For simplicity, we consider only processors failingand leaving the computation. We do not consider thecase of processors reentering the computation.

Following is a discussion and justification of theassumptions.

Fail-stop nodes may be implemented, for example,through concurrent error checking or self-checking tech-niques [14].

For Assumption 2 to be reasonable, the reconfigurationtime has to be kept as short as possible. We will show ex-perimental results later that indicate that this time is likelyto be relatively short, see Section 5.1.

Assumption 3 is unlikely to be true in practice and is thestrongest assumption we make here. More reasonablewould be that the network degrades by some, possibly con-stant, factor after each fault. Note, however, that, in general,the network connecting the processors is built independ-ently from the processors themselves [15]. Each processorconnects to a router and issues requests to the router. Thenetwork consists of the interconnections between therouters. If the processor fails, the router will be unaffected.As the router itself has more limited functions than theprocessor, it can be built simpler and, hence, is less likely tofail. Viewing the system in this way, however, we increasethe number of components and, hence, decrease the MTBFof the entire system! In any case, this is a problem thatneeds to be addressed further. We note that there have beena number of efforts toward gracefully degrading networks[16], [17]. The network degradation problem, in the contextof the BSP model, is addressed in Section 4.3.3.

The memory distribution scheme, as discussed in Sec-tion 3.2, requires memory duplication. Such a cost may beconsidered too high. First, note that MPP nodes today comewith a lot of main memory. Second, in some cases, nodesalso have access to hard disk drives, perhaps through a lo-cally attached disk. A simple improvement to our approachthat does not require a lot of modification would simplyhave each node keeping the initial secondary mapping as-signed to it on disk. For performance reasons, updates tothe secondary mapping would still be kept in main mem-ory and, perhaps, written out to disk at an appropriatemoment. Third, it is possible to minimize global memoryduplication by using techniques proposed in [18].

Finally, note that it is possible to change the FT scheme toallow failed processors that have recovered to reenter thecomputation at some later point. For example, a simple ap-proach would be to make processors that have recovered waituntil some other processor fails and, then, take over its place.

3.2 Memory DistributionThe logical memory is hashed using a linear hash function,as described in Section 2. We then distribute the hashedaddresses to the memory of each node using mod p, withprocessor 0 holding hashed addresses H0, Hp, H2p, ¤ ,


processor i holding hashed addresses Hi, Hp+i, H2p+i, ¤, andso on, see Fig. 2. Since we operate only on the hashed ad-dresses, the method of calculating them does not affect thisscheme, as noted before in Section 2. We will call this distri-bution the primary memory distribution.

To deal with a fault, we replicate each location of theprimary memory and store it on a different processor. Wecall this distribution the secondary memory distribution. Inthe event of a fault, the entire primary mapping willchange, as it is computed using mod p. Therefore, the wayof distributing the secondary mapping is motivated by thedesire to make the reconstitution of the primary mappingfaster. There is nothing we can do to speed up the re-constitution of the backup mapping.

We use mod (p - 1) to calculate the distribution of thesecondary mapping, and divide it among processors withid 0, ¤, p - 2. Certain locations will map in both the pri-mary and the secondary mapping to the same processor.Such locations are, for example, hashed addresses H0, ¤,Hp-2, H(p-1)*p, ¤, H(p-1)*p+(p-2) and so on, see Fig. 3. Theselocations are treated specially and are assigned to processorwith id p - 1 instead, as shown in Fig. 3. This distributionstill leads to each processor having approximately the samenumber of memory locations assigned to it in both the pri-mary and the secondary mapping.

Other secondary mappings with similar behavior do ex-ist, and they may be preferable in certain cases. In particu-lar, if we also plan to use memory replication to speed upmemory accesses, we can use a different mapping, de-scribed in [11], as our secondary mapping. This mapping,for our purposes, has similar behavior and similar commu-nication costs to the one we use here. It is discussed in [19].The distribution described above is preferred because theprimary mapping can be regenerated faster.

During computation, the secondary mapping is kept con-sistent with the primary one by generating an extra writeoperation for each normal write operation. The overhead ofthis extra write operation is evaluated in Section 4.4.1.

3.3 Reconfiguration After a FaultThere are three parts to reconfiguring the system after anode fails. First, we reconfigure the primary memory map-ping, then we reconfigure the secondary mapping, and,finally, reassign tasks to processors to get as even a com-putational load as possible. A detailed description of thereconfiguration algorithm follows:

First, the reconfiguration of the primary mapping. Thereare two distinct cases here, depending on which processorhas failed. In the case that the processor with id p - 1 fails,then the other processors already have the correct locationsof the new primary mapping in their secondary mapping.All they need to do is substitute the secondary mapping forthe primary one.

It is more likely, though, that some processor with idbetween 0 and p - 2 fails, e.g., processor f. In this case,processor p - 1 will assume id f, relinquishing its own. Then,it will wait to receive the memory locations it is missingfrom its new primary mapping from the other processors.

All the other processors first reestablish their new pri-mary mapping by simply swapping their primary mappingwith their secondary mapping, as described above. Then,every processor, including the processor assuming id f,scans its old primary and secondary mapping and picks outthe locations that belong to the current processor with id f.Then, each processor sends a message to the processor cur-rently holding id f with the locations that belong to it.

The processor with id f will merge the locations receivedtogether with the ones it has picked up from its own oldprimary and secondary mapping to build up its new pri-mary mapping. Once it is done, all the processors have avalid primary mapping.

After the primary mapping is reconstructed, then thesecondary mapping reconstruction proceeds as follows:Each processor scans its primary mapping and divides itinto p pieces, where p is the number of currently activeprocessors. Each processor then sends one message to everyother processor. Each processor receives p - 1 messages.The secondary mapping is reconstructed when each proces-sor has merged all the locations it has received.

Finally, task reconfiguration is trivial since each proces-sor can independently decide the task ids it currently holds.All tasks will be restarted from the end of the last successfulsuperstep. Computation reorganization is discussed inmore detail in the next section.

Fig. 2. Primary memory mapping.


In reestablishing both the new primary mapping anddistributing the backup, each processor can decide inde-pendently what actions to take, provided the id of the failedprocessor is known. Each processor can calculate which ofits locations need to be sent to which processor and howmany locations it needs to receive for both primary andsecondary mapping reconstitution. There is, therefore, noneed for any extra coordination and the processing isshared approximately equally among all the processors.

3.4 Computation Reorganization and Fault LatencyIn the computation model we are considering, each task isdistinguished only by its id. The tasks are synchronizedevery superstep. All important data is in the global memorywith the changes to the global memory not being commit-ted until the end of a superstep. If the current superstepfails to be completed, then these changes are discarded andexecution rolls back to the last correctly completed super-step. Because of the global barrier, there is always a correctpoint to restart the computation after a fault and no prob-lem like the Domino effect can occur. The way the barrier isused above is similar to ideas described in [20].

We distribute the tasks to the nodes, by using, as in thecase of the memory locations, mod p, to decide each tasksnew location. That is, if there are 64 nodes, node 0 will re-ceive tasks 0, 64, 128, ¤, node 1 will receive tasks with id1, 65, 129, ¤, and so on. After a fault occurs, the tasks areredistributed by reassigning tasks ids to processors. Proces-sors can reuse existing tasks, or create new tasks as needed.

As in the case of the memory reconfiguration, given theid of the failed node, each live node can decide which taskids it holds. After a processor fails, each remaining proces-sor will receive at most one extra task id. Therefore, thetime to reconfigure the computation is negligibly short andis dwarfed by the memory reconfiguration time. We do notconsider it further here.

The above distribution ensures that the tasks will beevenly distributed among the live processors. The load isalso likely to be distributed about evenly, as in PRAM algo-rithms, processors execute the same program, though notnecessarily the same program statements.

At this point, it is worth looking at how a fault might bedetected in this model. We will argue here that the faultlatency time is likely to be short due to the structure of the

Fig. 3. Primary and secondary mapping.


computation. If, as in the original proposal of the BSPmodel [8], we assume that we check every L time stepswhether the computation has finished or not, we can usethe same check to make sure that all the processors are stillalive. This check is of the same complexity. Hence, themaximum fault latency will be at most L time units, or asingle superstep. Alternatively, if we consider a superstepto end at any time after L time-units [9], we can still setsome maximum period after which there is a check to de-termine whether all the processors are still alive or not. Thismaximum period will be O(L) and probably quite short,both in terms of absolute time and in terms of the compu-tation that may be lost.

4 GRACEFUL DEGRADATION AND OVERHEAD OFFAULT TOLERANCE SCHEMES

First, we discuss, in general, the behavior of fault toleranceschemes on the BSP model and define the concepts of Over-head and Degradation in the context of the BSP model. Wethen apply these ideas to predict the behavior of the schemepresented in the previous section.

4.1 Evaluation AssumptionsA fault-tolerant system can be thought of as consisting of abase non-fault-tolerant part and a fault tolerance (FT)scheme. The FT scheme consists of operations that need to becarried out in the normal course of events and operationsthat are carried out when the system experiences a fault.

We call the operations that are carried out during thenormal course of events the Overhead Cost, 2, of the scheme.The operations that are carried out after a fault occurs wecall the recovery and reconfiguration operations. Here, we arenot concerned with the recovery operations but only withthe overhead operations. Typically, the overhead costshould also include the cost of fault diagnosis. As the op-eration of the BSP model requires a synchronization barrierat the end of each superstep, we assume that fault detectionoccurs at the barrier and, therefore, that the diagnosis costis part of the barrier cost [21]. Hence, we do not consider itas an overhead cost.

Once the system has experienced a fault and has beenreconfigured by applying the recovery procedure, a newsystem is constructed that will perform at some level, lowerthan before. Relating the performance of the reconfiguredsystem to the performance of the original fault-free system,for some algorithm running on the system, gives the grace-ful degradation of the scheme.

The degradation can be characterized by changes in theparameters of the BSP model, i.e., p, g, and L. We will ana-lyze the cases when p, and g change, but, for simplicity, ig-nore possible changes in the value of L. In general, the rateof synchronization, L, will also be affected by networkfaults if synchronization is carried out by message passing.In this case, changes in L may be treated in an analogousmanner to changes in the communication cost. In certaincases, where a machine has a dedicated synchronizationnetwork, e.g., the Fujitsu AP1000 [15], the synchronizationcost will not be affected by faults in the normal communi-cation network.

For simplicity, we assume throughout the followinganalysis that all supersteps of an algorithm have to carryout the same total amount of computation and communica-tion. Therefore, the superstep cost, CS, of every superstep isthe same provided no faults occur.

Depending on the analysis carried out, some furtherassumptions are made. They will be introduced whenappropriate.

4.2 The Overhead Cost

Let the original cost of a superstep S be CS. The FT schemeadds some cost to increase the superstep cost to ′CS . Theoverhead cost, 2, can then be defined as:

DEFINITION 4.1 (Overhead Cost).

2 = ′C CS S .

A consequence of the definition of the superstep cost, Defi-nition 2.1, is that an FT scheme may add operations to asystem and still cause no visible overhead. For example, itmay be that for a certain algorithm computation costs pre-dominate. Hence, if we can tailor an FT scheme that addsonly communication costs and still leave the computationcost the dominant factor, then we can be certain that nooverhead will be visible. Similarly, for the case of algo-rithms that are communication bound, we can tailor ascheme that adds only to the computation cost.

In fact, the worst case for a scheme under the superstep costassumed is when an algorithm has computation and commu-nication costs balanced. Then, any added operations will incursome visible overhead. In this case, there is at least freedom toadd both communication and computation operations, as longas we make sure that the overhead is not excessive.

In Section 4.4.1, we analyze the overhead of the previ-ously described FT scheme.

4.3 Graceful DegradationIf the cost of superstep, S, on a machine with no faults is

′CS( )0 and its cost on a machine that has suffered f faults is

′C fS( ), then the Degradation, after f faults, Df, is defined as:

DEFINITION 4.2 (Degradation).

D C f Cf S S= ′ ′1 6 0 50 .

The total number of operations, computation, and messagetransfers that a BSP model algorithm has to execute duringthe superstep does not change. The resources of the ma-chine, however, have decreased, causing the visible super-step cost to increase.

We can characterize the resource degradation in twoways: Processor faults, i.e., parameter p decreases, andfaults that cause the network to degrade, i.e., parameter gincreases. We therefore subdivide the analysis of the degra-dation by looking at the impact of changes in p and g on thesystem performance.

The case of processor failures is further split into twocategories. One category covers coarse grain computationand the other covers fine grain computation. In both cases,we assume that the computation cost of the algorithm is thedominant cost of the superstep and it always remains so,regardless of processor failures.


4.3.1 Coarse Grain ComputationThe number of tasks, or virtual processors, v, are fixed atthe beginning. For simplicity, assume that all tasks in su-

perstep S have the same cost, tw. Initially, all v tasks, wherev � p log p, are allocated approximately evenly to the pprocessors, using Brent’s scheduling principle [22]. That is,the maximum number of tasks that any processor has isÑv/pá, and the minimum number any other processor has is

Óv/pã. Then, the maximum computational load, wS, of any

node in a superstep is given by twi

v p

=∑ 1 or wS = Ñv/pá � tw.

After f processor failures and reconfigurations, themaximum load at any node will be ′ = − ×w v p f tS w( ) .

The performance degradation of the machine is thechange in the maximum load:

′ = × − ×w w v p t v p f tS S w w3 8 1 64 9= −v p v p f1 6 . (1)

A plot of (1) is shown in Fig. 4. The performance degrada-tion is not smooth, but moves in a step like manner. Largevalues of p imply smaller load imbalance after processorsfail and, hence, less observable degradation. It also meansthat the degradation “steps” come less frequently.

4.3.2 Fine Grain ComputationWe can do better than (1) if we assume that we can dividethe computation more finely. If a superstep S has to com-plete some total number of operations, WS, we assume thatwe can divide them completely evenly among the entire setof processors, p. Each processor will receive wx operations,where wx = WS/p. The superstep cost is then wx. After fprocessor failures and reconfigurations, the maximum loadat any node will be wy = WS/(p - f ).

The performance degradation of the machine is thechange in the maximum load:

w w W p f W py x S S= −1 63 8 2 7= −p p f1 6. (2)

Equation (2) describes the best possible performance thatwe can attain on the reconfigured system. Note that the twocases we have considered, (1) and (2), are asymptoticallythe same, refer to Fig. 4.

4.3.3 Network DegradationThe communication cost of a BSP model superstep is madeup of two components, the maximum number of messages,h, processed by any node, and the computation to commu-nication ratio, g, of the machine. We will analyze the changeof the communication cost of a superstep by looking at theway h and g change due to faults in the machine.

By using RSM, we can assume that messages are distrib-uted approximately evenly among the p processors. In thefollowing discussion, for simplicity, we will assume a com-pletely even message distribution. That is, each processorhandles exactly h messages and the total number of mes-sages in the system is H, where H = ph.

We also assume machines with nodes and routers thatare independent of each other. We then consider two typesof faults: processor failures and router failures. Processorfailures do not affect the operation of the router. Hence,network performance is unaffected. Router failures, in ad-dition to impairing the network performance, also renderthe processor inaccessible.

In the case of only processors failing, we will assumethat g does not change, even though g is defined as the ratiobetween the total number of local operations performed byall the processors per second and the total number of wordsdelivered by the communication network per second. Thereason is that g can also be thought of as the throughput ofthe router when in continuous operation [8]. Therefore, thisvalue should not change unless some property of the net-work also changes. We will model the effect of router fail-ures, however, as an increase in the parameter g.

4.3.3.1 Case 1: Only processors fail (g is unchanged). Thetotal number of messages, H, does not change, but thenumber of processors over which they are distributed does.Hence, each processor will now receive an increased num-ber of messages, h�. By a similar argument to the one given

Fig. 4. Ideal degradation for coarse and fine grain division of computation.


for estimating the increase in computational load in theprevious section, we determine that the increase in thenumber of messages a node must process is given byp/(p - f). Hence, regardless of which cost dominated thesuperstep cost before, the observable degradation will stillbe the same.

Note that, in the above, we are relying on properties ofthe RSM. Regardless of the reconfiguration and recoveryalgorithm, if RSM is used, then the memory accesses shouldstill be evenly distributed among the nodes of the system.Therefore, the above result should be a general property ofall FT schemes that rely on RSM.

4.3.3.2. Case 2: Router failures (g increases). Next, considerthe case when, after f router failures, parameter g has de-graded to some value gf. Then, the communication cost willincrease to gf h�, where h�, as before, represents the increasednumber of messages each node handles. Therefore, the deg-radation factor for the communication cost will be p/(p - f )due to the increase in number of messages each node han-dles and gf/g due to the change in g.

Now, consider an algorithm that has communication andcomputation costs balanced, i.e., originally gh < wS, wherewS is the computation cost of the superstep. The degrada-tion due to the change in computation cost, and the degra-dation due to the extra number of messages each node pro-cesses is the same, as we discussed before. Therefore, be-cause of the degradation in g, the communication cost willovertake the computation cost.

The above observation leads to the idea that we shouldbenchmark gracefully degrading networks so that we knownot only their g in fault-free situations, but also their valuesof gf. Then, when considering mapping a certain algorithmto a certain system, we can take into account the expectednumber of faults and, hence, the range of gf, and make surethat the computation cost of each superstep of the algo-rithm is not overtaken by a possibly faster increasing com-munication cost.

4.4 Analysis of the FT SchemeHere, we analyze briefly the overhead and graceful degra-dation of the FT scheme presented in Section 3, using theideas developed in the previous subsections.

4.4.1 Overhead of the FT SchemeThe overhead that the scheme introduced is one extrawrite operation for each write operation carried out bythe algorithm. Obviously, this cost will depend on the algo-rithm that is executed. From the superstep cost equation,the cost that will change relates to the maximum number ofmessages sent or received, i.e., ghs or ghr. In this paper, botha read and a write operation are implemented by twomessages, a request and a reply message.

Assume that we have a completely even distribution ofmessages, perhaps achieved through a hashing scheme.That is, the number of messages each node sends or re-ceives is the same, ghs = ghr. Then, if each node sends hswmessages to carry out write operations and hsr messages tocarry out read operations, we can determine the overheadcost, 2, as

2 =+ ×

+h h

h hs s

s s

r w

r w

2. (3)

In practice, of course, it is unlikely that a completely evendistribution of messages will be possible. Nonetheless, (3) islikely to be an upper bound of the overhead. For example,if each node, pi sends b log p messages, the maximum num-ber of messages any node, pj, will receive, with high prob-ability, is g log p where b, g are some constants and b < g.Now, if the number of messages each node sends increasesto b � log p, where b � > b, then the maximum number ofmessages any node receives will be g � log p, where g � > g. Itis likely, by the law of large numbers, that g �/g � b �/b.Hence, we can take the ratio b �/b as an upper bound forthe increase in messages any processor handles.

For example, the matrix algorithm [12], see Algorithm 5.1,that we will use as an example in Section 5.2 issues onewrite and two read operations per superstep. Therefore, thecommunication cost for the algorithm should increase by atmost 33.3 percent per superstep.

In the case of the knapsack [23] algorithm, see Algo-rithm 5.2, each iteration consists of two supersteps, onlyone of which, the second one, issues a write operation, andno read operations. Therefore, the maximum observableoverhead for the first superstep is 0 percent and for the sec-ond one 100 percent.

As noted earlier, this extra cost will not necessarily showin the overall superstep cost. In particular, if the computa-tion cost, wS, is big enough to cover this cost, then no over-head should be observable. Also notice that g, the ratio ofcomputation to communication, does not matter here as weare still talking about a system without any faults.

4.4.2 The Graceful Degradation Expected of the FTScheme

The computation model that the scheme assumes is a fixednumber of tasks that must be maintained on a decreasingnumber of processors. Hence, the analysis of Section 4.3.1must hold and the graceful degradation of the resulting sys-tem should be similar to that described in (1), see also Fig. 4.This prediction should be correct regardless of whether thecomputation cost or the communication cost of the algorithmdominates, since we assume no network degradation.

5 EXPERIMENTAL RESULTS

The approach described in the previous section was imple-mented on an MPP, the Fujitsu AP1000. The AP1000 is a dis-tributed memory machine, with dedicated networks for in-tercell communication, broadcasting, and for executingglobal, barrier-style, synchronization [15]. Global synchroni-zation can be done very fast and is independent of the num-ber of nodes that synchronize. We used C extended with amessage passing library to implement the above approach.

Each node runs a number of tasks: one global memorymanager, one synchronization/fault detection task, and anumber of computation tasks. The global memory managerholds the primary and secondary mapping, as describedpreviously. The synchronization/fault detection task carriesout barrier synchronization. It also informs the other taskswhen a fault occurs and arranges for starting and ending


the reconfiguration. Each computation task implements aprocessor of a PRAM algorithm. At the beginning of theexperiment exactly log p tasks on each node will be active.Because the AP1000 cell OS does not support dynamic taskcreation, we have to create all the computation tasks wemight need at initialization time, taking into account thenumber of faults that the run will investigate.

Access to the global memory is provided through twofunctions: gread implements the global read; gwrite()implements the global write. Note that, even in the casewhen a global memory location is on the same node as thetask that requests it, a message will be generated and thetask will be descheduled.

Depending on the application, a copy of the input datamight be distributed to the computation tasks, or hashed tothe global memory. All the other data is hashed to theglobal memory. When a node ends its computation for thecurrent superstep, it requests global synchronization bymaking a gsynch() call. Hence, a superstep will end whenall the nodes call gsynch(). We use this way of synchro-nizing because the AP1000 has a dedicated network forsynchronization that can perform barrier-style synchroni-zation very fast. Faults are assumed to be detected at thisbarrier and, once a fault is detected, system reconfigurationis initiated, and the tasks are restarted at the beginning ofthe incomplete superstep.

The first experiment investigates the time for memoryreconfiguration. Next, we present the algorithms to besimulated and results for the overhead of the approach de-scribed earlier, the graceful degradation achieved, as wellas the efficiency of the simulation of the BSP model for thealgorithms on the AP1000.

5.1 Reconfiguration ResultsWe measure only the memory reconfiguration time and,hence, we do not implement any computation tasks. Thecomputation reconfiguration is trivial and does not add anycost to the overall reconfiguration algorithm, see Section 3.4.Note that the memory reconfiguration time is independentof the computation being carried out. It depends only onthe global memory size and the number of processorsavailable.

A node, selected at random, is stopped. We measure thetime it takes to reconfigure a new system with p - 1 nodes.The time it takes to reconfigure various global memorysizes on a 128-node AP1000 is shown in Fig. 5. From thegraph, it can be seen that the time to reconfigure even rea-sonably large global memory sizes is relatively modest andcounted in terms of seconds. Also, if the number of faults issmall in relation to the total number of processors, the re-configuration time does not change substantially.

Note that, for these results, the messages sent during re-configuration are not optimized in any way specific to theAP1000 network topology. All nodes follow a simple se-quential scheme of first sending the data needed by proces-sor id 0 and, then, processor id 1, and so on up to themaximum id processor. The AP1000 network is physically atorus, but since we use a linear processor mapping, such away of sending messages leads to semi-random messagetraffic. It was felt that such a scheme is good enough since it

does not overload any specific processor, in the spirit of theBSP model. It is not immediately clear that a more sophisti-cated scheme will produce better results.

5.2 Evaluation of DegradationWe analyze the behavior of the FT scheme through twoquite different algorithms, a matrix multiplication algo-rithm [12] and a knapsack algorithm [23], to determine theaccuracy of the previous predictions. The matrix multipli-cation is chosen as an example of an algorithm whosegranularity can be tailored to fit a specific architecture. Wecan, therefore, consider it a best case algorithm. The knap-sack is chosen as it is an optimal PRAM algorithm that re-quires very fine-grain communication, with a communica-tion pattern that depends on the input data. It can beviewed as a worst case algorithm.

In taking the measurements presented here, we arrangedfor faults to occur at regular intervals after a certain numberof supersteps is completed. We define a period as the dura-tion between two faults and calculate how long a periodtakes to complete. The period length is not the same forboth algorithms, but it is consistently the same for meas-urements taken for each algorithm. The processor that failsis chosen at random. We evaluate the performance for anumber of faults and for different system sizes.

5.2.1 The Matrix MultiplicationThe matrix multiplication algorithm used follows:

Algorithm 5.1: The Matrix Multiplication Algorithm [12]foreach processor(i, j) := (1, 1) to (k, m) do

for r := 1 to l step 1 dot := (i + j + r) mod l; C[i, j] := C[i, j] + A[i, t] * B[t, j];

endendwhere Matrix A is k � l, Matrix B is l � m and Matrix C isk � m.

We first present results of the matrix multiplication algo-rithm on a 64- and 128-node AP1000. We obtained theseresults by setting each element of the matrix to be a tile of200 floating point numbers. For collecting the results, wedivide the computation into periods of 50 supersteps each.Referring to the algorithm, a period equals 50 iterations of

Fig. 5. Memory reconfiguration (128 nodes).


loop r. During each period, the number of processors isfixed. Period 0 has all p processors alive, period 1 has p - 1processors alive, and so on. Hence, the period number alsorepresents the number of faulty processors.

The average CPU utilization for this system was ap-proximately 16 percent. By increasing the number of opera-tions done on the tile by some multiple of the originalnumber of operations, we can get results that simulate thebehavior of the matrix multiplication when the communi-cation and computation cost are more balanced or the com-putation cost exceeds the communication cost. By a reverseargument, if the AP1000 was more efficient in communica-tion, then the behavior of the original algorithm might begiven by one of the manipulated cases.

The lines on the two graphs shown in Fig. 6 and Fig. 7represent the degradation of the system as the number offailed processors increases. The top-most line is the pre-dicted degradation as described in (1), Section 4.3.1. Theother lines show the algorithm at different computation tocommunication costs. The lowest line represents the resultsfor the original, unmodified, algorithm. The other lines arefor 10, 30, and 50 times the original computation cost.

The line representing the predicted degradation is la-beled an Upper Bound in these graphs for the following rea-son. Equation (1) gives an ideal view of the behavior of aload balanced system that has a dominant computationcost. In a real system, however, inefficiencies such as idletime due to communication latency may hide some of theincreased computation cost; some of the time taken by extracomputation is deducted from the idle time and does notshow up completely in the total computation time. There-fore, the upper line will define an upper bound on the visi-ble degradation.

The experimental data confirms this observation. As thecomputation cost of the superstep increases, the degrada-tion behavior follows more precisely the predicted behav-ior. When the computation cost is small, the communicationcost is not covered completely, hence, there is some idletime due to latency. As the system degrades and processorstake on more and more tasks, this idle time decreases, andcauses the observable degradation to be smaller than theone predicted.

This point is better understood if the breakdown of thedifferent costs for the matrix multiplication algorithm atdifferent computation to communication levels is shown,see Table 1 for the breakdown of the algorithm with thenormal number of operations and Table 2 for the results ofthe algorithm with 50 times the normal number of compu-tation operations per tile. Recall that the period number alsorepresents the number of failed processors.

In Table 1, faults have an effect on the idle/contextswitching time, while, in Table 2, the effect is on the per-centage of computation time. Notice that the increase in thepercentage of synchronization time should be expected andis a result of having some nodes with a higher load thanothers after a fault and reconfiguration takes place. As morefaults occur, more processors have the upper bound of thenumber of allocated tasks, refer to (1), and more of themarrive at the barrier approximately at the same time. Hence,

by the entry for period 9, the synchronization cost has re-turned back to the same value as initially.

The second trend that can be observed is that, as thecomputation cost increases, the performance of the algo-rithm becomes more stable. This effect is most noticeable inFig. 7. For the original algorithm, with no extra operationsadded, the degradation does not always increase. This be-havior is due to the hash function not distributing thememory accesses as perfectly as we assumed. For certainvalues of p, the message distribution is better, hence, thecommunication cost across all the periods is not the stablevalue that we assumed. When the computation cost is thedominant factor, a minor variation in communication costdoes not matter. Hence, the performance is more stable. Wewill discuss the effect of the hash function again later on,especially with regards to the grayed out areas that appearin Fig. 6 and Fig. 7.

5.2.2 The KnapsackThe second algorithm analyzed is a knapsack algorithmpresented in [23]. We chose this algorithm as it is an opti-mal PRAM algorithm and requires very fine-grain commu-nication, with a pattern that depends on the input data. Theknapsack algorithm follows:

Fig. 6. Matrix multiplication on 64 nodes, different ratios of computationto communication (data in gray area intentionally missing).

Fig. 7. Matrix multiplication on 128 nodes, different ratios of computa-tion to communication (data in gray area intentionally missing).


Algorithm 5.2: The Knapsack algorithm [23]for i := [1 ¤ Input_Length] do

processors 0 � y < w[i]do F[i, y] := F[i - 1, y]; endprocessors w[i] � y � cutoff_valuedo F[i, y] := max(F[i - 1, y], F[i - 1, y - w[i]] + e[i]); end

end

In the above algorithm, array e holds the set of inputelements. Array w holds the weight of each element in e,i.e., element e[i] has weight w[i]. The different possibleknapsack combinations, made up of elements in e up tosome cutoff_value, are stored in F.

The algorithm shown is not complete. It is the first phasethat computes the different possible knapsacks combina-tions that satisfy the cutoff_value. The second phase, notshown and not implemented, selects which elements willmake up the knapsack. The second phase is sequential,with only one processor active at any time.

For this algorithm, a copy of the input data is distributedto all the computation tasks, but the two-dimensional arrayF is hashed to the global memory. Note here that one itera-tion of the knapsack corresponds to two supersteps. Thefirst one evaluates the condition and reads the appropriatevalue of F, and the second computes the next value of F andwrites it to the appropriate location.

As the computation is too fine grain, however, we al-lowed execution to proceed without taking a barrier afterthe first superstep. We take a single barrier after the secondsuperstep ends, i.e., a barrier is taken after every iteration of

for loop i. Otherwise, the superstep operation is as de-scribed before for the matrix multiplication. Correct resultsare guaranteed in this case as no updates of the array F car-ried out in the current superstep are accessed until the bar-rier is passed.

In taking the measurements presented, we arranged forfaults to occur at regular intervals, every 3,000 loops. Wedefine a period as the duration between two faults, andcalculate how long a period, i.e., 3,000 loops, takes to com-plete. The processor that fails is chosen at random. Weevaluate the performance for nine faults and for two differ-ent system sizes.

The degradation for a 64- and a 512-node system as apercentage increase in computation time versus the numberof faults is shown in Fig. 8.

From the discussion in Section 4.3, we know that, re-gardless of whether the computation or the communicationcost is the dominant cost of the superstep, the degradationwe observe will be p/(p - f ), as, for this system, g does notdegrade. Hence, the 64- and 512-node systems should showa maximum degradation of 16.7 percent and 11.1 percent,respectively, for each superstep.

The main thing to notice in Fig. 8 is that the degradationof the algorithm seems to be greater than the expectedvalue. From (1), for the 64-node system, we expected atmost 16.7 percent degradation after a single fault, while theactual degradation observed starts at 22 percent. For the512-node system, the expected and actual values are 11.1percent and 17 percent, respectively. The reason is that we

TABLE 1MATRIX MULTIPLICATION ON 64 NODES, NO EXTRA OPERATIONS

Period Percentage of Time spent

CPU active Synch Idle Serve MEM req Idle or Context Switch0 16.1 4.2 16.3 63.51 16.1 12.0 15.2 56.62 15.9 10.5 15.8 57.73 16.1 9.8 15.9 58.24 16.2 9.1 15.9 58.85 16.4 8.3 15.9 59.46 16.6 7.3 16.3 59.87 16.8 5.6 16.1 61.48 17.1 5.6 16.2 61.09 17.1 3.9 16.8 62.2

TABLE 2MATRIX MULTIPLICATION ON 64 NODES, 50 TIMES THE NORMAL NUMBER OF OPERATIONS

Period Percentage of Time spent

CPU active Synch Idle Serve MEM req Idle or Context Switch0 78.7 1.1 4.3 15.81 69.9 8.7 3.7 17.72 71.0 6.2 3.8 19.13 72.0 5.3 3.8 19.04 73.0 4.6 3.8 18.65 74.0 4.0 3.8 18.26 74.9 3.5 3.9 17.77 76.4 2.4 3.9 17.28 77.5 2.0 4.1 16.59 78.7 1.0 4.3 15.9


joined the two supersteps together, as described earlier. Weare therefore seeing the combined degradation of two su-persteps. As we have two supersteps, the overall degrada-tion of one iteration of the knapsack algorithm will bebound by 33.4 percent in the case of the 64-node system,and 22.2 percent in the case of the 512-node system. Hence,the degradation we observed for this algorithm does followthe behavior described in (1).

5.3 Points Where the Predictions for the DegradationBreak Down

This section discusses the influence of the hash function, orof a bad memory distribution on the degradation. In Fig. 6and Fig. 7, certain points were grayed out. In Fig. 9, we pre-sent the complete graph for Fig. 7. The complete graph forFig. 6 is similar and we omit it. The reason we did not justpresent the complete graphs at the beginning is the scalingeffect that the huge degradation has on the rest of thegraph. It makes it impossible to see clearly the behavior ofthe degradation at other parts of the graph.

When discussing the behavior of the expected degrada-tion, Section 4, we made the assumption that communica-tion costs do not overshadow computation costs. We alsoassumed that the hash function used was indeed robustand behaved perfectly in all situations. Though classes ofhash functions that theoretically guarantee even distribu-

tions of memory requests exist [11], we have not used themhere because of their computational cost. Instead, we used alinear hash function which is not theoretically adequate, butfor which certain good simulation results have been re-ported [24].

It is obvious that the huge degradation at faults 10 and11 is due to the hash function not randomizing globalmemory requests well enough. Note that the pattern andthe number of memory accesses for the different versions ofthe matrix multiplication are the same and, therefore, thecommunication cost of the superstep is the same in all ver-sions of the algorithm. In the case of the algorithm with thehighest computation load, the jump is smallest because ofthe relatively small difference between the original highcost of the superstep and the increased communication cost.

It is perhaps unreasonable to expect a single hash func-tion to operate robustly over a wide range of processornumbers, p. If we determine that the behavior of a hashfunction for a certain number of processors is bad, we couldswitch to a different hash function at runtime, even thoughthis operation is very expensive.

5.4 The OverheadHere, we discuss the experimental results for the overheadof the matrix and knapsack algorithms.

5.4.1 The Matrix Multiplication OverheadThe matrix multiplication issues two read and one write op-eration per superstep. As pointed out in Section 4.4.1, theexpected maximum overhead should be at most 33.3 percent.

Fig. 10 plots the behavior of the visible overhead againstthe computational cost of a superstep for the matrix multipli-cation. When the computation cost is low, there is more visi-ble overhead than when the computation cost is high. Thisbehavior follows from the previous discussion. We conjecturethat if the computation cost becomes zero, then the full valueof the overhead, i.e., 33.3 percent, will become visible.

5.4.2 The Knapsack OverheadThe overhead statistics for the 64- and 512-node systemsaveraged over the entire runtime of a fault-free executionare shown in Table 3. Note that the visible overhead perperiod is uniform [21].

Fig. 8. Percentage of degradation of the knapsack algorithm.

Fig. 9. Matrix multiplication on 128 nodes, different ratios of computa-tion to communication.

Fig. 10. Overhead versus CPU utilization for 64 and 128 nodes (matrixmultiplication).


The overhead observed is very similar in both cases. Theprediction in Section 4.4.1 was that the overhead for theknapsack algorithm should be a maximum of 100 percent.The overhead observed is a much lower value: 15.7 percentand 13.9 percent for a 64- and 512-node system, respectively.

One reason why the observable overhead is so low, eventhough the algorithm is communication bound, is the veryhigh percentage of idle time due to communication latency.Part of the overhead may be hidden there. It is also likelythat the distribution of messages was not good to start offwith and, by adding the overhead messages, a relativelysmaller imbalance is caused. Another reason may be theexecution strategy we chose for this algorithm, i.e., themerging of the two supersteps.

5.5 Efficiency of the EmulationThe efficiency of Randomized Shared Memory and of the BSPmodel emulation has been investigated before [24]. Here, wediscuss briefly the efficiency of the two test algorithms.

Because the performance analyzer on the AP1000 cannothandle a large number of tasks, we cannot accurately de-termine the idle time or the time spent context switching.We therefore group these together. In both cases, we expecta high context switching time, as we could not employlightweight processes.

The matrix multiplication performance breakdown pre-sented in detail in Tables 1 and 2 is quite encouraging. Thecomputation grain of the matrix algorithm is much higherthan that of the knapsack and fits the machine better.

In the case of the knapsack algorithm (see Table 4 for asummary of the performance breakdown), the higher per-centage of context switch/idle time is to be expected as thecomputation steps between the communication requests arevery few. Even though the AP1000 has an efficient commu-nication network, its performance is still not at the levelsrequired by the BSP model for ensuring efficient PRAMemulation, hence, not all the communication latency wasmasked.

More surprising, even though a similar result was ob-tained in [24], is the average proportion of time spent syn-chronizing, column Synch Idle in Table 4. The proportion oftime spend by a node at the barrier increases, nearly dou-bles, between the 64- and 512-node system. This increasehappens even though the AP1000 provides hardware facili-ties for fast barrier synchronization.

It seems that this increase in synchronization time is in-fluenced not so much by the number of nodes in the systembut by load imbalance. The tasks of the knapsack vary inthe amount of work they do and, as the system scales up,this imbalance makes it more and more difficult for thenodes to reach the barrier at more or less the same time. Onthe other hand, the tasks of the matrix multiplication doexactly the same work and the synchronization time isquite low. The effect of load imbalance on the synchroniza-tion time in the case of the matrix algorithm is also visiblefor periods 2 to 8 in Tables 1 and 2.

6 CONCLUSION

We implemented the BSP model through RandomizedShared Memory on an MPP and defined a fault-tolerantscheme using properties of RSM. The scheme uses memoryduplication. The secondary mapping is kept up to date bygenerating an extra write operation for every normal globalmemory write. Hence, the overhead of our approach de-pends on the algorithm being executed and is likely to besmall, especially if we can manipulate the algorithm grainto fit the machine. The experimental results agree with thepredictions made about the overhead in Section 4.4.1.

The system can recover from a single-fault by carryingout global reconfiguration. System reconfiguration dependson the number of nodes and the memory size. Reconfigu-ration can be accomplished in a relatively short time, in theorder of seconds, even though global reconfiguration isdone. We did not consider the effect of network degrada-tion on the reconfiguration, however.

Once the system is reconfigured, it can withstand an-other fault. As long as each node has enough memory tohold both the primary and secondary mapping allocated toit, a new system can be configured. The system was alsoshown to be gracefully degrading and to follow closely thepredictions made in Section 4.3, especially for algorithmswith a good fit to the machine’s BSP model parameters.

It appears that Randomized Shared Memory is capableof giving adequate performance under the condition thatwe can fit the algorithm to the machine characteristics. Themost worrying issue is that load imbalance in the algorithmis very likely to lead to long barrier times and, therefore,potentially decreased performance. It may be necessary toexplore other synchronization methods.

TABLE 3KNAPSACK AVERAGE OVERHEAD FOR 64 AND 512 PROCESSORS

Processor Number Average Time (sec) to complete a period Overhead (%)

No Overhead With Overhead64 32.71 37.85 15.71

512 52.35 59.61 13.88

TABLE 4PERFORMANCE RESULTS FOR KKNAPSACK ALGORITHM ON 64 AND 512 PROCESSORS

Number of Nodes Average Percentage of Time spent

CPU active Synch Idle Serve MEM req Idle or Context Switch64 5.5 12.8 27.8 54.0

512 4.7 22.9 14.8 57.6


The effect of network degradation on both the reconfigu-ration time as well as the performance of the resulting sys-tem needs to be taken into account. Evaluation of the BSPmodel parameters for different machines needs to be car-ried out. Also, the way that faults in the communicationnetwork affect the parameter g of each machine has to betaken into account.

ACKNOWLEDGMENTS

We thank Fujitsu Parallel Computing Research Facilities inJapan for their generous access to the AP1000 MPP. We alsothank the anonymous referees for their help in improvingthis paper. This work was carried out while Andreas Savvawas with the Tokyo Institute of Technology on a JapanesMinistry of Education Scholarship.

REFERENCES

[1]� The U.S. President’s Office of Science and Technology Policy,“Grand Challenges 1993: High Performance Computing andCommunications,” 1993.

[2]� M. Peercy and P. Banerjee, “Design and Analysis of SoftwareReconfiguration Strategies for Hypercube Multicomputers UnderMultiple Faults,” Proc. 22nd Int’l Symp. Fault Tolerant Computing,pp. 448-455, June 1992.

[3]� R. Jagannathan and E.A. Ashcroft, “Fault Tolerance in ParallelImplementations of Functional Languages,” Proc. 21st Int’l Symp.Fault Tolerant Computing, pp. 256-263, June 1991.

[4]� K.H. Kim, “Programmer-Transparent Coordination of RecoveringConcurrent Processes: Philosophy and Rules for Efficient Imple-mentation,” IEEE Trans. Software Eng., vol. 14, no. 6, pp. 810-821,June 1988.

[5]� B. Vinnakota and N.K. Jha, “Synthesis of Algorithm-Based Fault-Tolerant Systems from Dependence Graphs,” IEEE Trans. Paralleland Distributed Systems, vol. 4, no. 8, pp. 864-874, Aug. 1993.

[6]� M. Schneider, “Self-Stabilization,” ACM Computing Surveys, vol. 25,pp. 45-67, Mar. 1993.

[7]� D. Culler, R. Karp, D. Patterson, A. Sahay, K.E. Schauser, E. San-tos, R. Subramonian, and T. von Eicken, “LogP: Towards a Realis-tic Model of Parallel Computation,” Fourth ACM SIGPLAN Proc.Symp. Principles and Practices of Parallel Programming, pp. 1-12,May 1993.

[8]� L. Valiant, “A Bridging Model for Parallel Computation,” Comm.ACM, vol. 33, pp. 103-111, Aug. 1990.

[9]� A.V. Gerbessiotis and L.G. Valiant, “Direct Bulk-SynchronousParallel Algorithms,” Proc. Third Scandinavian Workshop AlgorithmTheory 1992, O. Nurmi and E. Ukkonen, eds., pp. 1-18, 8-10 July1992.

[10]� R.H. Bisseling and W.F. McColl, “Scientific Computing on BulkSynchronous Parallel Architectures (Short Version),” Proc. 13thIFIP World Computer Congress, B. Pehrson and I. Simon, eds., vol. I,pp. 509-514, 1994.

[11]� K. Mehlhorn and U. Vishkin, “Randomized and DeterministicSimulation of PRAMs by Parallel Machines with Restricted Granu-larity of Parallel Memories,” Acta Informatica, vol. 21, pp. 339-374,1984.

[12]� C. Engelmann and J. Keller, “Simulation-Based Comparison ofHash Functions for Emulated Shared Memory,” Proc. PARLE ’93:Parallel Architectures and Languages Europe, A. Bode, M. Reeve, andG. Wolf, eds., pp. 1-11, June 1993.

[13]� R.D. Schlichting and F.B. Schneider, “Fail-Stop Processors: AnApproach to Designing Fault-Tolerant Systems,” ACM Trans.Computing Systems, vol. 1, pp. 222-238, Aug. 1983.

[14]� T. Nanya, “Design Approach to Self-Checking VLSI Processors,”Design Methodologies, S. Goto, ed., chapter 8, pp. 235-267. NorthHolland, 1985.

[15]� H. Ishihata, T. Horie, S. Inano, T. Shimizu, S. Kato, and M. Ike-saka, “Third Generation Message Passing Computer AP1000,”Proc. Int’l Symp. Supercomputing, 1991.

[16]� C.J. Glass and L.M. Ni, “Fault-Tolerant Wormhole Routing inmeshes,” Proc. 23rd Int’l Symp. Fault Tolerant Computing, pp. 240-249, June 1993.

[17]� K. Bolding and W. Yost, “Design of a Router for Fault-TolerantNetworks,” Proc. 1994 Parallel Computer Routing and Comm. Work-shop, pp. 226-240, May 1994.

[18]� J.S. Plank and K. Li, “Faster Checkpointing with N + 1 Parity,”Proc. 24th Int’l Symp. Fault Tolerant Computing, pp. 288-297, June1994.

[19]� A. Savva and T. Nanya, “Using the Bulk-Synchronous ParallelModel with Randomised Shared Memory for Graceful Degrada-tion,” Technical Report FTS93-23, IEICE, Aug. 1993. Also in Proc.Second Parallel Computing Workshop (PCW ’93) of Fujitsu ParallelComputing Research Facilities (FPCRF).

[20]� Z.M. Kedem and K.V. Palem, “Transformations for the AutomaticDerivation of Resilient Parallel Programs,” Proc. 1992 IEEE Work-shop Fault-Tolerant Parallel and Distributed Systems, pp. 16-25, July1992.

[21]� A. Savva and T. Nanya, “Gracefully Degrading Systems Using theBulk-Synchronous Parallel Model with Randomised SharedMemory,” Proc. 25th Int’l Symp. Fault Tolerant Computing, pp. 299-308, June 1995.

[22]� R.P. Brent, “The Parallel Evaluation of General Arithmetic Expres-sions,” J. ACM, vol. 21, pp. 201-206, Apr. 1974.

[23]� J. Lin and A. Storer, “A New Parallel Algorithm for the KnapsackProblem and Its Implementation on a Hypercube,” Proc. ThirdSymp. Frontiers of Massively Parallel Computation, J. JáJá, ed., pp. 2-7, Oct. 1990.

[24]� H. Hellwagner, “Randomized Shared Memory—Concept andEfficiency of a Scalable Shared Memory Scheme,” Parallel Architec-tures, pp. 102-117, Springer Verlag, 1993

Andreas Savva received the BSc (Eng) andMSc in computing from the Imperial College ofScience, Technology, and Medicine, University ofLondon, UK, and the DrEng degree from theTokyo Institute of Technology, Tokyo, Japan. Heis currently working at Fujitsu Ltd., Japan. Hisresearch interests include fault tolerance, mas-sively parallel processing, compilation tech-niques for parallel processors, and performanceevaluation. He is a member of the ACM, theIEEE, and the IEICE.

Takashi Nanya received the BE and ME de-grees in mathematical engineering and informa-tion physics from the University of Tokyo, Tokyo,Japan, in 1969 and 1971, respectively, and hisDrEng degree in electrical engineering from theTokyo Institute of Technology, Tokyo, Japan, in1978. He worked on digital system design meth-odology at NEC Central Research Laboratoriesfrom 1971 to 1981. In 1981, he moved to theTokyo Institute of Technology, where he was aprofessor of computer science. In 1995, he

joined the University of Tokyo, where he is a professor at the ResearchCenter for Advanced Science and Technology.

Dr. Nanya was a visiting research fellow at Oakland University,Michigan, in the fall quarter of 1982, and at Stanford University, Cali-fornia, in the 1986-1987 academic year. His research interests includefault-tolerant computing, computer architecture, design automation,and asynchronous computing. He received the IEICE Best PaperAward in 1987, Okawa Prize for Publication in 1994, and the ASP-DACBest Paper Award in 1998. He served as program cochair of the 1994IEEE International Symposium on Fault-Tolerant Computing, as con-ference cochair of the 1996 IEEE International Symposium on Ad-vanced Research in Asynchronous Circuits and Systems, and as guesteditor of a special issue on asynchronous architecture in IEE Pro-ceedings, Computers, and Digital Techniques. He is a senior memberof the IEEE.

Documents

A gracefully degrading massively parallel system using the BSP model, and its evaluation