Download pptx - ERLANGEN REGIONAL COMPUTING CENTER 08.09.2015 1st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application

ERLANGEN REGIONAL COMPUTING CENTER

08.09.20151st International Workshop on Fault Tolerant Systems,IEEE Cluster `15

Building a fault tolerant application using the GASPI communication layer

2

Motivation

Nowadays, the increasing computational capacity is mainly due to extreme level of hardware parallelism.

With future machines, the Mean time to failure is expected to be in minutes and hours.

Absence of fault tolerant environment will put precious data at risk.

The lack of well-defined fault tolerant environment is the first big challenge in the development of fault tolerant application.


3

Automatic Fault Tolerance Application: Approaching the problem

1. Failure detection:

i. Who detects the failure?

ii. Failure information propagation

iii. Consensus about failed processes

2. Processes and communicator recovery

i. Shrink

ii. Spawn

iii. Spare

3. Lost data recovery

…


4

Failure detection approaches:

1. Ping based all-to-all• Within each iteration, the health of all procecsses is

check.

2. Ping based neighbor-level:• After neighbor failure detection -> check all-to-all health

3. Unsuccessful communication• After failure detection -> check all-to-all health

4. Dedicated failure detection process(es)• Pings all other processes• Global view of processes healths• Propagates the failure info to remaining processes


5

Automatic Fault Tolerance Application: Approaching the problem

1. Failure detection: Fault-detector process

2. Processes and communicator recovery Spare nodes

3. Lost data recovery „Neighbor“ node level Checkpoint/Restart


4 5 6 7 8 9

Worker communicator

1 2 3Spare nodes

0 Fault -detector process

6

Fault tolerance in GASPI: Introduction (I)

GASPI – Developed by Fraunhofer IWTM, Kaiserslautern, Germany Based on PGAS programming model Two memory parts

• Local: only local to the GASPI process (and its threads)

• Global: Available to other processes for reading and writing.

Enables fault tolerance:

• In case of single node failure, rest of the nodes stay up and running

• Provides TIMEOUT for every communication call. Return values:

GASPI_SUCCESS, GASPI_TIMEOUT, GASPI_ERROR


7

Fault tolerance in GASPI: Introduction (II)

What GASPI provides:• Gaspi_proc_ping(): A process can check the state of a process by

pinging any specific process.

• The return value of ping can either be 0 or 1 (Healthy or dead).

User side:• Deletion of old comm., creation of new comm., new communication

structure, (checkpoint/restart) -> user‘s responsibility


8

Failure detector (I):

0

1 2 3

4 5 6 7 8 9

Worker communicator

Idle processes

gaspi_proc_ping()return_val = gaspi_wait()

return_val:1) GASPI_SUCCESS2) GASPI_TIMEOUT3) GASPI_ERROR


Fault detector process


9

Failure detector (II):

0

3

4 5 6 7 8 9

Worker communicator

Idle processes


GASPI_ERROR Failed Proc(s) IDs

Rescue Proc(s) IDs

6, 7 1, 2

Failure detector process

Detector processes informs every process about failure details via gaspi_write().

1 2 return_val:1) GASPI_SUCCESS2) GASPI_TIMEOUT3) GASPI_ERROR


10

Automatic Fault Tolerance Application

Program flow:


11

Asynchronous in-memory checkpointing


12

Benchmarks (I): Test bed

Lanczos algorithm:

Checkpoint data structure: After startup: Every process once stores matrix communication

data structure. Two recent Lanczos vectors are stored at each checkpoint

iteration. Recently calculated eigenvalues.

Test cluster: LiMa – RRZE, Erlangen: 500 nodes, Xeon 5650 "Westmere" chips (12 cores

+ SMT), 2.66 GHz, 24 GB RAM, QDR Infiniband


Checkpoint data: vj, vj+1 metadata

13

Benchmark (II):

Average ping time per process ~ 5-6 µs

Failure-Detector Process: Weak scaling of ping scan, failure detection and ack. time.


14

Benchmarks (III):

64s

Fai

lure

det

ecti

on

+

re-i

nit

+ r

edo

-wo

rk

Co

mp

uta

tio

nC

om

pu

tati

on

Num. of nodes = 256, threads-per-process = 12

Failure detection + acknowledgement

+

Re-init

= 11 sec.


# iters. = 3500

Chpt. freq = 500

15

Remarks:

Worker processes remain undisturbed in failure-free application run.

Overhead only in case of worker failure(s).

Redo-Work after failure recovery Checkpoint Frequency.


16 Building a fault tolerant application using the GASPI communication layer

Outlook:

Related work: FT communication: › MPICH-V

› User-level Failure Mitigation - MPI (ULFM)

› Fault tolerance Messaging Interface FMI

Node-level checkpoint/restart: › Fault Tolerance Interface (FTI)

› Scalable Checkpoint/Restart (SCR)

Future work: Having multiple failure detector processes. Adding Redundancy for failure detector processes Compartive study: ULFM, SCR

17

Thank you! Questions?


http://blogs.fau.de/essex/

https://bitbucket.org/essex/ghost