ERLANGEN REGIONAL COMPUTING CENTER
08.09.20151st International Workshop on Fault Tolerant Systems,IEEE Cluster `15
Building a fault tolerant application using the GASPI communication layer
2
Motivation
Nowadays, the increasing computational capacity is mainly due to extreme level of hardware parallelism.
With future machines, the Mean time to failure is expected to be in minutes and hours.
Absence of fault tolerant environment will put precious data at risk.
The lack of well-defined fault tolerant environment is the first big challenge in the development of fault tolerant application.
Building a fault tolerant application using the GASPI communication layer
3
Automatic Fault Tolerance Application: Approaching the problem
1. Failure detection:
i. Who detects the failure?
ii. Failure information propagation
iii. Consensus about failed processes
2. Processes and communicator recovery
i. Shrink
ii. Spawn
iii. Spare
3. Lost data recovery
…
Building a fault tolerant application using the GASPI communication layer
4
Failure detection approaches:
1. Ping based all-to-all• Within each iteration, the health of all procecsses is
check.
2. Ping based neighbor-level:• After neighbor failure detection -> check all-to-all health
3. Unsuccessful communication• After failure detection -> check all-to-all health
4. Dedicated failure detection process(es)• Pings all other processes• Global view of processes healths• Propagates the failure info to remaining processes
Building a fault tolerant application using the GASPI communication layer
5
Automatic Fault Tolerance Application: Approaching the problem
1. Failure detection: Fault-detector process
2. Processes and communicator recovery Spare nodes
3. Lost data recovery „Neighbor“ node level Checkpoint/Restart
Building a fault tolerant application using the GASPI communication layer
4 5 6 7 8 9
Worker communicator
1 2 3Spare nodes
0 Fault -detector process
6
Fault tolerance in GASPI: Introduction (I)
GASPI – Developed by Fraunhofer IWTM, Kaiserslautern, Germany Based on PGAS programming model Two memory parts
• Local: only local to the GASPI process (and its threads)
• Global: Available to other processes for reading and writing.
Enables fault tolerance:
• In case of single node failure, rest of the nodes stay up and running
• Provides TIMEOUT for every communication call. Return values:
GASPI_SUCCESS, GASPI_TIMEOUT, GASPI_ERROR
Building a fault tolerant application using the GASPI communication layer
7
Fault tolerance in GASPI: Introduction (II)
What GASPI provides:• Gaspi_proc_ping(): A process can check the state of a process by
pinging any specific process.
• The return value of ping can either be 0 or 1 (Healthy or dead).
User side:• Deletion of old comm., creation of new comm., new communication
structure, (checkpoint/restart) -> user‘s responsibility
Building a fault tolerant application using the GASPI communication layer
8
Failure detector (I):
0
1 2 3
4 5 6 7 8 9
Worker communicator
Idle processes
gaspi_proc_ping()return_val = gaspi_wait()
return_val:1) GASPI_SUCCESS2) GASPI_TIMEOUT3) GASPI_ERROR
gaspi_proc_ping()return_val = gaspi_wait()
Fault detector process
Building a fault tolerant application using the GASPI communication layer
9
Failure detector (II):
0
3
4 5 6 7 8 9
Worker communicator
Idle processes
gaspi_proc_ping()return_val = gaspi_wait()
GASPI_ERROR Failed Proc(s) IDs
Rescue Proc(s) IDs
6, 7 1, 2
Failure detector process
Detector processes informs every process about failure details via gaspi_write().
1 2 return_val:1) GASPI_SUCCESS2) GASPI_TIMEOUT3) GASPI_ERROR
Building a fault tolerant application using the GASPI communication layer
10
Automatic Fault Tolerance Application
Program flow:
Building a fault tolerant application using the GASPI communication layer
11
Asynchronous in-memory checkpointing
Building a fault tolerant application using the GASPI communication layer
12
Benchmarks (I): Test bed
Lanczos algorithm:
Checkpoint data structure: After startup: Every process once stores matrix communication
data structure. Two recent Lanczos vectors are stored at each checkpoint
iteration. Recently calculated eigenvalues.
Test cluster: LiMa – RRZE, Erlangen: 500 nodes, Xeon 5650 "Westmere" chips (12 cores
+ SMT), 2.66 GHz, 24 GB RAM, QDR Infiniband
Building a fault tolerant application using the GASPI communication layer
Checkpoint data: vj, vj+1 metadata
13
Benchmark (II):
Average ping time per process ~ 5-6 µs
Failure-Detector Process: Weak scaling of ping scan, failure detection and ack. time.
Building a fault tolerant application using the GASPI communication layer
14
Benchmarks (III):
64s
Fai
lure
det
ecti
on
+
re-i
nit
+ r
edo
-wo
rk
Co
mp
uta
tio
nC
om
pu
tati
on
Num. of nodes = 256, threads-per-process = 12
Failure detection + acknowledgement
+
Re-init
= 11 sec.
Building a fault tolerant application using the GASPI communication layer
# iters. = 3500
Chpt. freq = 500
15
Remarks:
Worker processes remain undisturbed in failure-free application run.
Overhead only in case of worker failure(s).
Redo-Work after failure recovery Checkpoint Frequency.
Building a fault tolerant application using the GASPI communication layer
16 Building a fault tolerant application using the GASPI communication layer
Outlook:
Related work: FT communication: › MPICH-V
› User-level Failure Mitigation - MPI (ULFM)
› Fault tolerance Messaging Interface FMI
Node-level checkpoint/restart: › Fault Tolerance Interface (FTI)
› Scalable Checkpoint/Restart (SCR)
Future work: Having multiple failure detector processes. Adding Redundancy for failure detector processes Compartive study: ULFM, SCR
17
Thank you! Questions?
Building a fault tolerant application using the GASPI communication layer
http://blogs.fau.de/essex/
https://bitbucket.org/essex/ghost