Upload
gin
View
57
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Fault Tolerant MPI. Anthony Skjellum *$ , Yoginder Dandass $ , Pirabhu Raman * MPI Software Technology, Inc * Misissippi State University $ FALSE2002 Workshop November 14, 2002. Outline. Motivations Strategy Audience Technical Approaches Summary and Conclusions. Motivations for MPI/FT. - PowerPoint PPT Presentation
Citation preview
Fault Tolerant Fault Tolerant MPIMPI
Anthony SkjellumAnthony Skjellum*$*$, Yoginder Dandass, Yoginder Dandass$$, , Pirabhu RamanPirabhu Raman**
MPI Software Technology, IncMPI Software Technology, Inc**
Misissippi State UniversityMisissippi State University$$
FALSE2002 WorkshopFALSE2002 WorkshopNovember 14, 2002November 14, 2002
2
OutlineOutline
MotivationsMotivations
StrategyStrategy
AudienceAudience
Technical ApproachesTechnical Approaches
Summary and ConclusionsSummary and Conclusions
3
Motivations for MPI/FTMotivations for MPI/FT
Well written and well tested legacy MPI Well written and well tested legacy MPI applications will abort, hang or die more applications will abort, hang or die more often in harsh or long-running environments often in harsh or long-running environments because of extraneously introduced errors.because of extraneously introduced errors.
Parallel Computations are Fragile at PresentParallel Computations are Fragile at Present There is apparent demand for recovery of There is apparent demand for recovery of
running parallel applicationsrunning parallel applications Learn how “fault tolerant” can we make MPI Learn how “fault tolerant” can we make MPI
programs and implementations without programs and implementations without abandoning this programming modelabandoning this programming model
4
StrategyStrategy Build on MPI/Pro, a commercial MPIBuild on MPI/Pro, a commercial MPI Support extant MPI applicationsSupport extant MPI applications Define application requirements/subdomainsDefine application requirements/subdomains Do a very good job for Master/Slave Model FirstDo a very good job for Master/Slave Model First Offer higher-availability servicesOffer higher-availability services Harden TransportsHarden Transports Work with OEMs to offer more recoverable Work with OEMs to offer more recoverable
servicesservices Use specialized parallel computational models to Use specialized parallel computational models to
enhance effective coverageenhance effective coverage Exploit third-party checkpoint/restart, nMR, etc.Exploit third-party checkpoint/restart, nMR, etc. Exploit Gossip for DetectionExploit Gossip for Detection
5
AudienceAudience
High Scale, Higher Reliability UsersHigh Scale, Higher Reliability Users Low Scale, Extremely High Reliability Low Scale, Extremely High Reliability
Users (nMR involved for some nodes)Users (nMR involved for some nodes) Users of clusters for production runsUsers of clusters for production runs Users of embedded multicomputersUsers of embedded multicomputers Space-based, and highly embedded Space-based, and highly embedded
settingssettings Grid-based MPI applicationsGrid-based MPI applications
6
Detection and Recovery Detection and Recovery From Extraneously From Extraneously
Induced ErrorsInduced Errors
Application
MPI
Errors/Failures
Network, Drivers, NIC
ABFT/aBFT
MPI Sanity
WATCHDOG/BIT/OTHER
DETECTION
RECOVERY Process
Application Recovers
APPLICATIONEXECUTION
MODEL SPECIFICS
OS, R/T, Monitors
N/W Sanity
7
mpirun –np NP my.app no error
my.app finishes
mpirun finishes
(success)
MPI-lib error
my.app aborts
mpirun finishes
(failure)
process dies
my.app hangs
mpirun hangs
(failure)
aborted run ?
send it to ground
n
yhung job ?
continue waiting
n
abort my.app y
DETECTION
RECOVERYRECOVERY
Coarse Grain Detection and Coarse Grain Detection and RecoveryRecovery
(adequate if no SEU)
8
r0sendbuf
MPI
r1recvbuf
MPI
NIC NIC
user level
device level
user level
device level
2nd highest SEU strike-rate after main cpu
• Legacy MPI applications will be run in simplex mode
Example: NIC errorsExample: NIC errors
9
““Obligations” of a Fault-Obligations” of a Fault-Tolerant MPITolerant MPI
Ensure Reliability of Data Transfer at the Ensure Reliability of Data Transfer at the MPI LevelMPI Level
Build Reliable Header FieldsBuild Reliable Header Fields Detect Process FailuresDetect Process Failures Transient Error Detection and HandlingTransient Error Detection and Handling Checkpointing supportCheckpointing support Two-way negotiation with scheduler and Two-way negotiation with scheduler and
checkpointing componentscheckpointing components Sanity checking of MPI applications and Sanity checking of MPI applications and
underlying resources (non-local)underlying resources (non-local)
10
Initiate Device Level Communication
Low Level Success ?
yReturn MPI_Success
Error Type ? TimeoutAsk SCT:
Is Peer Alive ?
n
Trigger Event
Reset Timeout
A
A
OtherAsk EH: Recoverable?
Trigger Event y
Reset Timeout
A
y
n
n
EH : Error HandlerSCT: Self-Checking Thread
Low Level Detection Low Level Detection Strategy for Errors and Strategy for Errors and
Dead ProcessesDead Processes
11
App SysMFT-I No ranks nMR Yes YesMFT-II Several ranks nMR Yes Yes
No ranks nMR YesSeveral ranks nMR Yes
MFT-IIIs Rank 0 nMR Yes YesMFT-IIIm Several ranks nMR Yes YesMFT-IVs Rank 0 nMR Yes YesMFT-IVm Several ranks nMR Yes Yes
No ranks nMR YesSeveral ranks nMR Yes
Cp/Recov
SPMDWith MPI-1.2
No MPI
Application Style
MPI Support nMRModel Name
Master/Slave
With MPI-1.2
With MPI-2 DPM
No MPI
Overview of MPI/FT Overview of MPI/FT ModelsModels
12
Rank 1(non-nMR)
Rank 0Rank 0
Rank 0
nMRRank
Rank 1(non-nMR)
Rank 0Rank 0
Rank 0
nMRRank
Design Choices Design Choices
Replicated ranks send/receive messages independently from each other
One copy of the replicated rank acts as the message conduit
Rank 0Rank 0
Rank 0
nMRRank
Rank 1Rank 1
Rank 1
nMRRank
Rank 0Rank 0
Rank 0
nMRRank
Rank 1Rank 1
Rank 1
nMRRank
Replicated ranks send/receive messages independently from each
other
One copy of the replicated rank acts as the message conduit
Message Replication (nMR to nMR)Message Replication (nMR to nMR)
Message Replication (nMR to simplex)Message Replication (nMR to simplex)
13
00 11
22 3300 11
22 3300 11
22 33
A B C
n=3; np=4
•Voting on messages only (not on each word of state)
• Local errors remain local
• Requires two failures to fail (e.g., A0 and C0)
Parallel nMR Parallel nMR AdvantagesAdvantages
14
Master (rank 0) is nMRMaster (rank 0) is nMR MPI_COMM_WORLD is maintained in MPI_COMM_WORLD is maintained in
nMRnMR MPI-1.2 only MPI-1.2 only
Application cannot actively manage Application cannot actively manage processesprocesses
Only middleware can restart processesOnly middleware can restart processes Pros: Pros:
Supports send/receive MPI-1.2Supports send/receive MPI-1.2 Minimizes excess messagingMinimizes excess messaging Largely MPI application transparentLargely MPI application transparent Quick recovery possibleQuick recovery possible
ABFT based process recovery assumed.ABFT based process recovery assumed. Cons:Cons:
Scales to O(10) Ranks onlyScales to O(10) Ranks only Voting still limitedVoting still limited Application explicitly fault awareApplication explicitly fault aware
Rank 1(Slave)
Rank 2(Slave)
Rank n(Slave)
Logical flow of MPI messages
Actual flow of MPI message
Rank 0Rank 0
Rank 0
Replicated Rank 0
Messagefrom 1to 0
Message from 0 to 2
MFT-IIIs: Master/Slave MFT-IIIs: Master/Slave with MPI-1.2with MPI-1.2
15
Master (rank 0) is nMRMaster (rank 0) is nMR MPI_COMM_WORLD is maintained in MPI_COMM_WORLD is maintained in
nMRnMR MPI_COMM_SPAWN()MPI_COMM_SPAWN()
Application can actively restart processesApplication can actively restart processes Pros: Pros:
Supports send/receive MPI-1.2 + DPMSupports send/receive MPI-1.2 + DPM Minimizes excess messagingMinimizes excess messaging Largely MPI application transparentLargely MPI application transparent Quick recovery possible, simpler than MFT-Quick recovery possible, simpler than MFT-
IIIsIIIs ABFT based process recovery assumed.ABFT based process recovery assumed.
Cons:Cons: Scales to O(10) Ranks onlyScales to O(10) Ranks only Voting still limitedVoting still limited Application explicitly fault awareApplication explicitly fault aware
Rank 1(Slave)
Rank 2(Slave)
Rank n(Slave)
Logical flow of MPI messages
Actual flow of MPI message
Rank 0Rank 0
Rank 0
Replicated Rank 0
Messagefrom 1to 0
Message from 0 to 2
MFT-IVs: Master/Slave MFT-IVs: Master/Slave with MPI-2with MPI-2
16
Checkpointing the Master for Checkpointing the Master for recoveryrecovery
Master checkpointsMaster checkpoints Voting on master livenessVoting on master liveness Master failure detectedMaster failure detected
Lowest rank slave restarts Lowest rank slave restarts master from checkpointed datamaster from checkpointed data
Any of the slaves could promote Any of the slaves could promote and assume the role of masterand assume the role of master
Peer liveness knowledge Peer liveness knowledge required to decide the lowest required to decide the lowest rankrank
Pros:Pros: Recovery independent of the Recovery independent of the
number of faults number of faults No additional resourcesNo additional resources
Cons:Cons: Checkpointing further reduces Checkpointing further reduces
scalabilityscalability Recovery time depends on the Recovery time depends on the
checkpointing frequencycheckpointing frequency
Rank 1(Slave)
Rank 2(Slave)
Rank n(Slave)
MPI messages
Rank 0
Messagefrom 1to 0
Message from 0 to 2
StorageMedium
Checkpointing data
17
Checkpointing Slaves for Checkpointing Slaves for RecoveryRecovery
(Speculative)(Speculative) Slaves checkpoint periodically at a Slaves checkpoint periodically at a
low frequencylow frequency Prob. of failure of a slave > prob. of Prob. of failure of a slave > prob. of
failure of the masterfailure of the master Master failure detectedMaster failure detected
Recovered from data Recovered from data checkpointed at various slavescheckpointed at various slaves
Peer liveness knowledge required Peer liveness knowledge required to decide the lowest rankto decide the lowest rank
Pros:Pros: Checkpointing overhead of Checkpointing overhead of
master eliminatedmaster eliminated Aids in faster recovery of slavesAids in faster recovery of slaves
Cons:Cons: Increase in Master recovery time Increase in Master recovery time Increase in overhead due to Increase in overhead due to
checkpointing of all the slavescheckpointing of all the slaves
Rank 1(Slave)
Rank 2(Slave)
Rank n(Slave)
Flow of MPI messages
Rank 0
Messagefrom 1to 0
Message from 0 to 2
SM
Checkpointing data
SM SM
SM Storage Medium
•Slaves are stateless and hence checkpointing slaves doesn’t help in anyway in restarting the slaves
•Checkpointing at all the slaves could be really expensive
•Instead of checkpointing slaves could return the results tto the master
18
Adaptive checkpointing and nMR Adaptive checkpointing and nMR of the master for recoveryof the master for recovery
Start with ‘n’ replicatesStart with ‘n’ replicates Initial Checkpointing calls Initial Checkpointing calls
generate No-opsgenerate No-ops Slaves track the liveness of Slaves track the liveness of
master and the replicatesmaster and the replicates Failure of last replicate Failure of last replicate
initiates checkpointinginitiates checkpointing Pros:Pros:
Tolerates ‘n’ faults with Tolerates ‘n’ faults with negligible recovery timenegligible recovery time
Subsequent faults can Subsequent faults can still be recoveredstill be recovered
Cons:Cons: Increase in overhead of Increase in overhead of
tracking the replicatestracking the replicates
Rank 1(Slave)
Rank 2(Slave)
Rank n(Slave)
Logical flow of MPI messages
Actual flow of MPI message
Rank 0Rank 0
Rank 0
Replicated Rank 0
Messagefrom 1to 0
Message from 0 to 2
StorageMedium
Checkpointing data
19
Self-Checking ThreadsSelf-Checking Threads(Scales > O(10) nodes can be (Scales > O(10) nodes can be
considered)considered)
Invoked by MPI library
•Checks whether peers are alive•Checks for network sanity•Server to coordinator queries•Exploits timeouts•Periodic execution, no polling•Provides heart-beat across app.•Can check internal MPI state
Queries by coordinator
•Vote on communicator state•Check buffers•Check queues for aging•Check local program state•Invoked Periodically•Invoked when suspicion arises
20
MPI/FT SCT Support MPI/FT SCT Support LevelsLevels
Simple, Non-portable, uses internal state Simple, Non-portable, uses internal state of MPI and/or system (“I”)of MPI and/or system (“I”)
Simple, Portable, exploits threads and Simple, Portable, exploits threads and PMPI_ (“II”) or PERUSE APIPMPI_ (“II”) or PERUSE API
Complex state checks, Portable exploits Complex state checks, Portable exploits queue interfaces (“III”)queue interfaces (“III”)
All of the aboveAll of the above
21
MPI/FT CoordinatorMPI/FT Coordinator Spawned by mpirun or similarlySpawned by mpirun or similarly
Closely coupled / friend with the MPI library of Closely coupled / friend with the MPI library of applicationapplication
User TransparentUser Transparent
Periodically collects status information from the Periodically collects status information from the applicationapplication
Can kill and restart the application or individual Can kill and restart the application or individual ranksranks
Preferably implemented using MPI-2 Preferably implemented using MPI-2
We’d like to replicate and distribute this functionalityWe’d like to replicate and distribute this functionality
22
Use of Gossip in MPI/FTUse of Gossip in MPI/FT
Applications in Model III and IV assume a star Applications in Model III and IV assume a star topologytopology
Gossip requires a virtual all-to-all topologyGossip requires a virtual all-to-all topology Data network may be distinct from the control Data network may be distinct from the control
networknetwork Gossip provides:Gossip provides:
Potentially scalable and fully distributed Potentially scalable and fully distributed scheme for failure detection and notification scheme for failure detection and notification with reduced overheadwith reduced overhead
Notification of failures in the form of Notification of failures in the form of broadcastbroadcast
23
Gossip-based Failure Gossip-based Failure DetectionDetection
NodNodee
HearHeart t
BeatBeat
11 22
22 00
33 22
1
3
NodNodee
HearHeart t
BeatBeat
11 00
22 33
33 44
NodNodee
HearHeart t
BeatBeat
11 33
22 11
33 00
Gossip
NodNodee
HearHeart t
BeatBeat
11 22
22 00
33 22
0
2
Heartbeat 0 < 2
No updateHeartbeat 3 > 0 UpdateHeartbeat 4>2 Update
00 00 00 00 00 00
00 00 00
NodNodee
HearHeart t
BeatBeat
11 00
22 00
33 22
3
Node dead !!!
Tcleanup 5 * Tgossip
S - Suspect vector
51
1 2 3S
1 2 3S
1 2 3S
Node 1’s Data
Node 2’s Data
Node 2’s Data
Node 3’s Data
Clock
2
Cycles Elapsed :
1234
34
34
1
NodNodee
HearHeart t
BeatBeat
11 00
22 00
335
24
Consensus about FailureConsensus about Failure
00 00 11
00 00 00
00 00 00
00 00 00
00 00 11
00 00 00
11 11 11
11 11 11
1 2 3
1 2 3
At Node 1
At Node 2
Suspect matrices
merged at Node 1
00 00 11
00 00 11
00 00 00
11 11 11
1 2 3
At Nodes 1 and 2
L
L
L 011 11 00
L – Live list
Node 3 dead
25
Issues with Gossip - IIssues with Gossip - I
After node After node aa fails fails If node If node bb, the node, the node that arrives at that arrives at
consensus on node consensus on node aa’s’s failure last failure last (notification broadcaster) also fails before (notification broadcaster) also fails before broadcastbroadcast Gossiping continues until another node, Gossiping continues until another node,
c,c,suspects that node suspects that node bb has failed has failed Node Node cc broadcasts the failure notification of broadcasts the failure notification of
node node aa Eventually node Eventually node b b is also determined to have is also determined to have
failedfailed
26
Issues with Gossip - IIIssues with Gossip - II
If control and data networks are separate:If control and data networks are separate: MPI progress threads monitor the status of the MPI progress threads monitor the status of the
data networkdata network Failure of the link to the master is indicated when Failure of the link to the master is indicated when
communication operations timeoutcommunication operations timeout Gossip monitors the status of the control networkGossip monitors the status of the control network
The progress threads will communicate the The progress threads will communicate the suspected status of the master node to the suspected status of the master node to the gossip threadgossip thread
Gossip will incorporate this information in Gossip will incorporate this information in its own failure detection mechanismits own failure detection mechanism
27
Issues with RecoveryIssues with Recovery
If network failure causes the partitioning of If network failure causes the partitioning of processes:processes: Two or more isolated groups may form that Two or more isolated groups may form that
communicate within themselvescommunicate within themselves Each group assumes that the other processes have Each group assumes that the other processes have
failed and attempts recoveryfailed and attempts recovery Only the group that can reach the checkpoint data is Only the group that can reach the checkpoint data is
allowed to initiate recovery and proceedallowed to initiate recovery and proceed The issue of recovering when multiple groups can access The issue of recovering when multiple groups can access
the checkpoint data is under investigationthe checkpoint data is under investigation If only nMR is used, the group with the master is If only nMR is used, the group with the master is
allowed to proceedallowed to proceed The issue of recovering when the replicated master The issue of recovering when the replicated master
processes are split between groups is under investigationprocesses are split between groups is under investigation
28
Shifted APIsShifted APIs Try to “morally conserve” MPI standardTry to “morally conserve” MPI standard Timeout parameter added to messaging calls to control Timeout parameter added to messaging calls to control
the behavior of individual MPI callsthe behavior of individual MPI calls Modify existing MPI callsModify existing MPI calls Add new calls with the added functionality to support ideaAdd new calls with the added functionality to support idea
Add a callback function to MPI calls (for error handling)Add a callback function to MPI calls (for error handling) Modify existing MPI callsModify existing MPI calls Add new calls with the added functionalityAdd new calls with the added functionality
Support in-band or out-of-band error management made Support in-band or out-of-band error management made explicit to applicationexplicit to application
Runs in concert with MPI_ERRORS_RETURN.Runs in concert with MPI_ERRORS_RETURN. Offers opportunity to give hints as well, where Offers opportunity to give hints as well, where
meaningful.meaningful.
29
Application-based Application-based CheckpointCheckpoint
Point of synchronization for a cohort of Point of synchronization for a cohort of processesprocesses
Minimal fault tolerance could be applied only at Minimal fault tolerance could be applied only at such checkpointssuch checkpoints
Defines “save state” or “restart data” needed to Defines “save state” or “restart data” needed to resumeresume
Common practice in parallel CFD and other MPI Common practice in parallel CFD and other MPI codes, because of reality of failurescodes, because of reality of failures
Essentially gets no special help from systemEssentially gets no special help from system Look to Parallel I/O (MPI-2) for improvementLook to Parallel I/O (MPI-2) for improvement Why? Minimum complexity of I/O + FeasibleWhy? Minimum complexity of I/O + Feasible
30
In situ Checkpoint In situ Checkpoint OptionsOptions
Checkpoint to bulk memoryCheckpoint to bulk memory Checkpoint to flashCheckpoint to flash Checkpoint to other distributed RAMCheckpoint to other distributed RAM Other choices?Other choices? Are these useful … depends on error Are these useful … depends on error
modelmodel
31
Early Results with Hardening Early Results with Hardening Transport:Transport:
CRC vs. time-based nMRCRC vs. time-based nMR
020406080
100120140
32 128
512
2048
8192
3276
8
1310
72
5242
88
size (bytes)
To
tal t
ime
(sec
) no crccrc3mr
Comparison of nMR, CRC Comparison of nMR, CRC with baseline using MPI/Pro with baseline using MPI/Pro
(version 1.6.1-1tv)(version 1.6.1-1tv)
MPI/Pro Comparisons of Time Ratios,
normalized against baseline performance
01234567
size (bytes)
Tim
e r
ati
o crc/nocrc
3mr/nocrc
32
Early Results, II.Early Results, II.
Comparison of nMR and CRCComparison of nMR and CRC with baseline using MPICHwith baseline using MPICH
(version 1.2.1)(version 1.2.1)
0500
100015002000
size (bytes)
Tota
l tim
e (s
ec)
no crccrc3mr
MPICH Comparison of Time Ratios Using baseline MPI/Pro Timings
0
10
20
30
40
50
60
70
size (bytes)
tim
e r
ati
os
nocrc_mpich/nocrc_mpipro
crc_mpich/nocrc_mpipro
3mr_mpich/nocrc_mpipro
33
Early Results, IIIEarly Results, IIItime-based nMR with time-based nMR with
MPI/ProMPI/Pro
Total Time for 10,000 Runs vs Total Time for 10,000 Runs vs Message Size for Various nMRMessage Size for Various nMR
0
10
20
30
40
50
60
70
80
size (bytes)
tim
e (s
ec)
3mr4mr5mr6mr7mr8mr9mr
MPI/Pro Time Ratio Comparisons
for various nMR to baseline
0
5
10
15
20
size (bytes)
Tim
e ra
tio
3mr/nocrc
4mr/nocrc
5mr/nocrc
6mr/nocrc
7mr/nocrc
8mr/nocrc
9mr/nocrc
34
Other Possible ModelsOther Possible Models
Master Slave was considered beforeMaster Slave was considered before Broadcast/Reduce Data Parallel Apps.Broadcast/Reduce Data Parallel Apps. Independent Processing + Corner Independent Processing + Corner
TurnsTurns Ring ComputingRing Computing Pipeline Bi-Partite Computing Pipeline Bi-Partite Computing General MPI-1 models (all-to-all)General MPI-1 models (all-to-all) Idea: Trade Generality for CoverageIdea: Trade Generality for Coverage
35
What about Receiver-What about Receiver-Based Models?Based Models?
Should we offer, instead or in addition to Should we offer, instead or in addition to MPI/Pro, a receiver-based model?MPI/Pro, a receiver-based model?
Utilize publish/subscribe semanticsUtilize publish/subscribe semantics Bulletin boards? Tagged messages, etc.Bulletin boards? Tagged messages, etc. Try to get rid of single point of failure this wayTry to get rid of single point of failure this way Sounds good, can it be done?Sounds good, can it be done? Will anything like an MPI code work?Will anything like an MPI code work? Does anyone code this way!? (e.g., Java Does anyone code this way!? (e.g., Java
Spaces, Linda, military embedded distributed Spaces, Linda, military embedded distributed computing)computing)
36
Plans for Upcoming 12 Plans for Upcoming 12 MonthsMonths
Continue Implementation of MPI/FT to Continue Implementation of MPI/FT to Support Applications in simplex modeSupport Applications in simplex mode
Remove single points of failure for Remove single points of failure for master/slavemaster/slave
Support for multi-SPMD ModelsSupport for multi-SPMD Models Explore additional application-relevant Explore additional application-relevant
modelsmodels Performance StudiesPerformance Studies Integrate fully with Gossip protocol for Integrate fully with Gossip protocol for
detectiondetection
37
Summary & ConclusionsSummary & Conclusions
MPI/FT = MPI/Pro + one or more availability enhancementsMPI/FT = MPI/Pro + one or more availability enhancements Fault-tolerant concerns leads to new MPI implementations Fault-tolerant concerns leads to new MPI implementations Support for Simplex, Parallel nMR and/or Mixed ModeSupport for Simplex, Parallel nMR and/or Mixed Mode nMR not scalablenMR not scalable Both Time-based nMR and CRC (depends upon message size Both Time-based nMR and CRC (depends upon message size
and the MPI implementation) - can do nowand the MPI implementation) - can do now Self Checking Threads - can do nowSelf Checking Threads - can do now Coordinator (Execution Models) - can do very soonCoordinator (Execution Models) - can do very soon Gossip for detection – can do, need to integrateGossip for detection – can do, need to integrate Shifted APIs/callbacks - easy to do, will people use?Shifted APIs/callbacks - easy to do, will people use? Early Results with CRC vs. nMR over TCP/IP cluster shownEarly Results with CRC vs. nMR over TCP/IP cluster shown
38
Related WorkRelated Work G. Stellner (CoCheck, 1996)G. Stellner (CoCheck, 1996)
Checkingpointing that works with MPI and CondorCheckingpointing that works with MPI and Condor M. Hayden (M. Hayden (The Ensemble SystemThe Ensemble System, 1997), 1997)
Next-generation Horus communication toolkitNext-generation Horus communication toolkit EvripidouEvripidou et al ( et al (A Portable Fault Tolerant Scheme for MPI,A Portable Fault Tolerant Scheme for MPI, 1998) 1998)
Redundant processes approach to masking failed nodesRedundant processes approach to masking failed nodes A. Agbaria and R. Friedman, (Starfish, 1999)A. Agbaria and R. Friedman, (Starfish, 1999)
Event bus, works in specialized language, related to EnsembleEvent bus, works in specialized language, related to Ensemble G.F. Fagg, and J.J. Dongarra, (FT-MPI, 2000)G.F. Fagg, and J.J. Dongarra, (FT-MPI, 2000)
Growing/shrinking communicators in response to node failures, memory-Growing/shrinking communicators in response to node failures, memory-based checkpointing, reversing calculation?based checkpointing, reversing calculation?
G. Bosilca et al, (MPICH-V, 2002) – new, to be presented at SC2002 G. Bosilca et al, (MPICH-V, 2002) – new, to be presented at SC2002 Grid-based modifications to MPICH for “volatile nodes”Grid-based modifications to MPICH for “volatile nodes” Automated checkpoint, rollback, message loggingAutomated checkpoint, rollback, message logging
Also, substantial literature related to ISIS/HORUS (Dr. Berman et al Also, substantial literature related to ISIS/HORUS (Dr. Berman et al at Cornell) that is interesting for distributed computingat Cornell) that is interesting for distributed computing