Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
App. Specific FT
An overview of fault-tolerant techniques for HPC
Thomas Herault1 & Yves Robert1,2
1 – University of Tennessee Knoxville2 – ENS Lyon & Institut Universitaire de France
[email protected] | [email protected]
http://graal.ens-lyon.fr/~yrobert/sc13tutorial.pdf
SC’2013 Tutorial
[email protected] — [email protected] Fault-tolerance for HPC 1/ 56
App. Specific FT
Thanks
INRIA & ENS Lyon
Anne Benoit
Frederic Vivien
PhD students (Guillaume Aupy, Dounia Zaidouni)
UT Knoxville
George Bosilca
Aurelien Bouteiller
Jack Dongarra
Others
Franck Cappello, Argonne and UIUC-Inria joint lab
Henri Casanova, Univ. Hawai‘i
Amina Guermouche, UIUC-Inria joint lab
[email protected] — [email protected] Fault-tolerance for HPC 2/ 56
App. Specific FT
Outline
1 Application-specific fault-tolerance techniques (45mn)Fault-Tolerant MiddlewareBags of tasksIterative algorithms and fixed-point convergenceABFT for Linear Algebra applicationsComposite approach: ABFT & Checkpointing
[email protected] — [email protected] Fault-tolerance for HPC 3/ 56
App. Specific FT
Outline
1 Application-specific fault-tolerance techniques (45mn)Fault-Tolerant MiddlewareBags of tasksIterative algorithms and fixed-point convergenceABFT for Linear Algebra applicationsComposite approach: ABFT & Checkpointing
[email protected] — [email protected] Fault-tolerance for HPC 4/ 56
App. Specific FT
Fault Tolerance Software Stack
Application
Lib1 Lib2
Comm. Middleware (MPI)
OS
Network
RuntimeHelpers
[email protected] — [email protected] Fault-tolerance for HPC 5/ 56
App. Specific FT
Fault Tolerance Software Stack
Application
Lib1 Lib2
Comm. Middleware (MPI)
OS
Network
RuntimeHelpers
NetworkTransientFailures
(inc. msg corruption)Fault Tolerance
AutomaticPermanent
CrashFault Tolerance
Application-BasedPermanent
CrashFault Tolerance
PermanentCrash
Detection
[email protected] — [email protected] Fault-tolerance for HPC 5/ 56
App. Specific FT
Motivation
Motivation
Generality can prevent Efficiency
Specific solutions exploit more capability, have moreopportunity to extract efficiency
Naturally Fault Tolerant Applications
[email protected] — [email protected] Fault-tolerance for HPC 6/ 56
App. Specific FT
Outline
1 Application-specific fault-tolerance techniques (45mn)Fault-Tolerant MiddlewareBags of tasksIterative algorithms and fixed-point convergenceABFT for Linear Algebra applicationsComposite approach: ABFT & Checkpointing
[email protected] — [email protected] Fault-tolerance for HPC 7/ 56
App. Specific FT
HPC – MPI
HPC
Most popular middleware for multi-node programming inHPC: Message Passing Interface (+Open MP +pthread +...)
Fault Tolerance in MPI:
[...] it is the job of the implementor of the MPIsubsystem to insulate the user from this unreliability,or to reflect unrecoverable errors as failures.Whenever possible, such failures will be reflected aserrors in the relevant communication call. Similarly,MPI itself provides no mechanisms for handlingprocessor failures.
– MPI Standard 3.0, p. 20, l. 36:39
[email protected] — [email protected] Fault-tolerance for HPC 8/ 56
App. Specific FT
HPC – MPI
HPC
Most popular middleware for multi-node programming inHPC: Message Passing Interface (+Open MP +pthread +...)
Fault Tolerance in MPI:This document does not specify the state of acomputation after an erroneous MPI call hasoccurred.
– MPI Standard 3.0, p. 21, l. 24:25
[email protected] — [email protected] Fault-tolerance for HPC 8/ 56
App. Specific FT
HPC – MPI
MPI Implementations
Open MPI (http://www.open-mpi.org)
On failure detection, the runtime system kills all processestrunk: error is never reported to the MPI processes.ft-branch: the error is reported, MPI might be partly usable.
MPICH (http://www.mcs.anl.gov/mpi/mpich/)
Default: on failure detection, the runtime kills all processes.Can be de-activated by a runtime switchErrors might be reported to MPI processes in that case. MPImight be partly usable.
[email protected] — [email protected] Fault-tolerance for HPC 9/ 56
App. Specific FT
FT Middleware in HPC
Not MPI. Sockets, PVM... CCI?http://www.olcf.ornl.gov/center-projects/
common-communication-interface/ UCCS?
FT-MPI: http://icl.cs.utk.edu/harness/, 2003
MPI-Next-FT proposal (Open MPI, MPICH): ULFM
User-Level Failure Mitigationhttp://fault-tolerance.org/ulfm/
Checkpoint on Failures: the rejuvenation in HPC
[email protected] — [email protected] Fault-tolerance for HPC 10/ 56
App. Specific FT
MPI-Next-FT proposal: ULFM
Goal
Resume Communication Capability for MPI (and nothing more)
Failure Reporting
Failure notification propagation / Distributed Statereconciliation
=⇒ In the past, these operations have often been merged=⇒ this incurs high failure free overheads
ULFM splits these steps and gives control to the user
Recovery
Termination
[email protected] — [email protected] Fault-tolerance for HPC 11/ 56
App. Specific FT
MPI-Next-FT proposal: ULFM
Goal
Resume Communication Capability for MPI (and nothing more)
Error reporting indicates impossibility to carry an operation
State of MPI is unchanged for operations that can continue(i.e. if they do not involve a dead process)
Errors are non uniformly returned
(Otherwise, synchronizing semantic is altered drastically withhigh performance impact)
New APIs
REVOKE allows to resolve non-uniform error status
SHRINK allows to rebuild error-free communicators
AGREE allows to quit a communication pattern knowing it isfully complete
[email protected] — [email protected] Fault-tolerance for HPC 12/ 56
App. Specific FT
MPI-Next-FT proposal: ULFM
Errors are visible only for operations thatcannot complete
Error Reporting
Operations that cannot complete return
ERR PROC FAILED, or ERR PENDING ifappropriateState of MPI Objects is unchanged(communicators etc.)Repeating the same operation has thesame outcome
Operations that can be completed returnMPI SUCCESS
point to point operations betweennon-failed ranks can continue
Errors are visible only for operations that can’t complete • Operations that can’t complete return
ERR_PROC_FAILED • State of MPI objects unchanged
(communicators, etc) • Repeating the same operation has the same
outcome • Operations that can be completed
return MPI_SUCCESS • Pt-2-pt operations between non failed ranks
can continue
S(1) PF
time
R(2) R(1)
R(2)
S(2)
S(3)
S R(0) S
PF
S(0) S
S S(2)
R(3) S
S
S S
[email protected] — [email protected] Fault-tolerance for HPC 13/ 56
App. Specific FT
MPI-Next-FT proposal: ULFM
Inconsistent Global State and Resolution
Error Reporting
Operations that can’t complete return
ERR PROC FAILED, or ERR PENDING ifappropriate
Operations that can be completed returnMPI SUCCESS
Local semantic is respected (buffercontent is defined), this does notindicate success at other ranks.New constructsMPI Comm Revoke/MPI Comm shrink
are a base to resolve inconsistenciesintroduced by failure
Incoherent global state and resolution • Operations that can’t complete return
ERR_PROC_FAILED • Operations that can be completed
return MPI_SUCCESS • local semantic is respected (that is buffer
content is defined), it does not indicate success at other ranks!
• New constructs Comm_Revoke resolves inconsistencies introduced by failures
Bcast S S PF
Bcast Revoke
R R
Shrink
Bcast S S S
time
[email protected] — [email protected] Fault-tolerance for HPC 14/ 56
App. Specific FT
MPI-Next-FT proposal: ULFM
Resilience Extensions for MPI: ULFMULFM provides targeted interfaces to empower recovery strategies with adequate options to restore communication capabilities and global consistency, at the necessary levels only.
Sequoia AMG is an unstructured physics mesh application with a complex communication pattern that employs both point-to-point and collective operations. Its failure free performance is unchanged whether it is deployed with ULFM or normal Open MPI.
The failure of rank 3 is detected and managed by rank 2 during the 512 bytes message test. The connectivity and bandwidth between rank 0 and rank 1 are unaffected by failure handling activities at rank 2.
CONTINUE ACROSS ERRORS
In ULFM, failures do not alter the state of MPI communicators. Point-to-point operations can continue undisturbed between non-faulty processes. ULFM imposes no recovery cost on simple communication patterns that can proceed despite failures.
GROUP EXCEPTIONS
Consistent reporting of failures would add an unacceptable performance penalty. In ULFM, errors are raised only at ranks where an operation is disrupted; other ranks may still complete their operations. A process can use MPI_[Comm,Win,File]_revoke to propagate an error notification on the entire group, and could, for example, interrupt other ranks to join a coordinated recovery.
COLLECTIVE OPERATIONS
Allowing collective operations to operate on damaged MPI objects (Communicators, RMA windows or Files) would incur unacceptable overhead. The MPI_Comm_shrink routine builds a replacement communicator, excluding failed processes, which can be used to resume collective communications, spawn replacement processes, and rebuild RMA Windows and Files.
Master
W1
W2
Wn
Send (W1,T1)Submit T1
Send (W2,T1)Resubmit
Recv (ANY)Detected W1
Recv(P1): failureP2 calls RevokeP1
P2
P3
Pn
Recv(P1) Recv(P1): revoked
Recovery
P1
P2
P3
Pn
Bcast
Bcast
Shrink
Bcast
BA
ND
WID
TH
(G
bit
/s)
MESSAGE SIZE (Bytes)
ULFM Fault Tolerant MPI Performance with failuresIMB Ping-pong between ranks 0 and 1 (IB20G)
Open MPIFT Open MPI (w/failure at rank 3)
0
1
2
3
4
5
6
7
8
9
10
11
12
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M
LA
TE
NC
Y (
us)
4
4.5
5
5.5
6
6.5
7
7.5
8
1 4 16 64 256 1K
-1%
-0.5%
+0%
+0.5%
+1%
8 16 32 64 128 256 512
DIF
FE
RE
NC
E IN
RU
NN
ING
TIM
E
NUMBER OF PROCESSES
Sequoia AMG Performance with Fault Tolerance
No
n-F
T is
fa
ste
rU
LF
M is
fa
ste
r
OPEN MPI ULFM IMPLEMENTATION PERFORMANCE
Resilience Extensions for MPI: ULFMULFM provides targeted interfaces to empower recovery strategies with adequate options to restore communication capabilities and global consistency, at the necessary levels only.
Sequoia AMG is an unstructured physics mesh application with a complex communication pattern that employs both point-to-point and collective operations. Its failure free performance is unchanged whether it is deployed with ULFM or normal Open MPI.
The failure of rank 3 is detected and managed by rank 2 during the 512 bytes message test. The connectivity and bandwidth between rank 0 and rank 1 are unaffected by failure handling activities at rank 2.
CONTINUE ACROSS ERRORS
In ULFM, failures do not alter the state of MPI communicators. Point-to-point operations can continue undisturbed between non-faulty processes. ULFM imposes no recovery cost on simple communication patterns that can proceed despite failures.
GROUP EXCEPTIONS
Consistent reporting of failures would add an unacceptable performance penalty. In ULFM, errors are raised only at ranks where an operation is disrupted; other ranks may still complete their operations. A process can use MPI_[Comm,Win,File]_revoke to propagate an error notification on the entire group, and could, for example, interrupt other ranks to join a coordinated recovery.
COLLECTIVE OPERATIONS
Allowing collective operations to operate on damaged MPI objects (Communicators, RMA windows or Files) would incur unacceptable overhead. The MPI_Comm_shrink routine builds a replacement communicator, excluding failed processes, which can be used to resume collective communications, spawn replacement processes, and rebuild RMA Windows and Files.
Master
W1
W2
Wn
Send (W1,T1)Submit T1
Send (W2,T1)Resubmit
Recv (ANY)Detected W1
Recv(P1): failureP2 calls RevokeP1
P2
P3
Pn
Recv(P1) Recv(P1): revoked
Recovery
P1
P2
P3
Pn
Bcast
Bcast
Shrink
Bcast
BA
ND
WID
TH
(G
bit
/s)
MESSAGE SIZE (Bytes)
ULFM Fault Tolerant MPI Performance with failuresIMB Ping-pong between ranks 0 and 1 (IB20G)
Open MPIFT Open MPI (w/failure at rank 3)
0
1
2
3
4
5
6
7
8
9
10
11
12
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M
LA
TE
NC
Y (
us)
4
4.5
5
5.5
6
6.5
7
7.5
8
1 4 16 64 256 1K
-1%
-0.5%
+0%
+0.5%
+1%
8 16 32 64 128 256 512
DIF
FE
RE
NC
E IN
RU
NN
ING
TIM
E
NUMBER OF PROCESSES
Sequoia AMG Performance with Fault Tolerance
No
n-F
T is f
aste
rU
LF
M is f
aste
r
OPEN MPI ULFM IMPLEMENTATION PERFORMANCE
Open MPI - ULFM support
Branch of Open MPI (www.open-mpi.org)
Maintained on bitbucket:https://bitbucket.org/icldistcomp/ulfm
[email protected] — [email protected] Fault-tolerance for HPC 15/ 56
App. Specific FT
Outline
1 Application-specific fault-tolerance techniques (45mn)Fault-Tolerant MiddlewareBags of tasksIterative algorithms and fixed-point convergenceABFT for Linear Algebra applicationsComposite approach: ABFT & Checkpointing
[email protected] — [email protected] Fault-tolerance for HPC 16/ 56
App. Specific FT
Master/WorkerExample: Master-worker
MPI_Irecv_init(comm, ANY_SOURCE, work_done) While(more_work && workers) { submit_work(worker[i++ % workers]) rc = MPI_Test(work_done) if(MPI_SUCCESS != RC) { MPI_COMM_FAILURE_ACK(comm) MPI_COMM_FAILURE_GET_ACKED(comm, i) worker[i] = worker[workers--] resubmit_work(worker[i], i) } }
a b
c
d
b
e
Master
Worker0 Worker1 Worker2
Worker
while(1) {
MPI_Recv( master, &work );
if( work == STOP_CMD )
break;
process_work(work, &result);
MPI_Send( master, result );
}
[email protected] — [email protected] Fault-tolerance for HPC 17/ 56
App. Specific FT
Master/Worker
Master
for(i = 0; i < active_workers; i++) {
new_work = select_work();
MPI_Send(i, new_work);
}
while( active_workers > 0 ) {
MPI_Wait( MPI_ANY_SOURCE, &worker );
MPI_Recv( worker, &work );
work_completed(work);
if( work_tocomplete() == 0 ) break;
new_work = select_work();
if( new_work) MPI_Send( worker, new_work );
}
for(i = 0; i < active_workers; i++) {
MPI_Send(i, STOP_CMD);
}
[email protected] — [email protected] Fault-tolerance for HPC 18/ 56
App. Specific FT
FT Master
Fault Tolerant Master
/* Non-FT preamble */
for(i = 0; i < active_workers; i++) {
new_work = select_work();
rc = MPI_Send(i, new_work);
if( MPI_SUCCESS != rc ) MPI_Abort(MPI_COMM_WORLD);
}
/* FT Section */
<...>
/* Non-FT epilogue */
for(i = 0; i < active_workers; i++) {
rc = MPI_Send(i, STOP_CMD);
if( MPI_SUCCESS != rc ) MPI_Abort(MPI_COMM_WORLD);
}
[email protected] — [email protected] Fault-tolerance for HPC 19/ 56
App. Specific FT
FT Master
Fault Tolerant Master
while( active_workers > 0 ) { /* FT Section */
rc = MPI_Wait( MPI_ANY_SOURCE, &worker );
switch( rc ) {
case MPI_SUCCESS: /* Received a result */
break;
case MPI_ERR_PENDING:
case MPI_ERR_PROC_FAILED: /* Worker died */
<...>
continue;
break;
default:
/* Unknown error, not related to failure */
MPI_Abort(MPI_COMM_WORLD);
}
<...>
[email protected] — [email protected] Fault-tolerance for HPC 20/ 56
App. Specific FT
FT Master
Fault Tolerant Master
case MPI_ERR_PENDING:
case MPI_ERR_PROC_FAILED:
/* A worker died */
MPI_Comm_failure_ack(comm);
MPI_Comm_failure_get_acked(comm, &group);
MPI_Group_difference(group, failed,
&newfailed);
MPI_Group_size(newfailed, &ns);
active_workers -= ns;
/* Iterate on newfailed to mark the work
* as not submitted */
failed = group;
continue;
[email protected] — [email protected] Fault-tolerance for HPC 21/ 56
App. Specific FT
FT Master
Fault Tolerant Master
rc = MPI_Recv( worker, &work );
switch( rc ) {
/* Code similar to the MPI_Wait code */
<...>
}
work_completed(work);
if( work_tocomplete() == 0 ) break;
new_work = select_work();
[email protected] — [email protected] Fault-tolerance for HPC 22/ 56
App. Specific FT
FT Master
Fault Tolerant Master
if(new_work) {
rc = MPI_Send( worker, new_work );
switch( rc ) {
/* Code similar to the MPI_Wait code */
/* Re-submit the work somewhere */
<...>
}
}
} /* End of while( active_workers > 0 ) */
MPI_Group_difference(comm, failed, &living);
/* Iterate on living */
for(i = 0; i < active_workers; i++) {
MPI_Send(rank_of(comm, living, i), STOP_CMD);
}
[email protected] — [email protected] Fault-tolerance for HPC 23/ 56
App. Specific FT
Outline
1 Application-specific fault-tolerance techniques (45mn)Fault-Tolerant MiddlewareBags of tasksIterative algorithms and fixed-point convergenceABFT for Linear Algebra applicationsComposite approach: ABFT & Checkpointing
[email protected] — [email protected] Fault-tolerance for HPC 24/ 56
App. Specific FT
Iterative Algorithm
while( gnorm > epsilon ) {
iterate();
compute_norm(&lnorm);
rc = MPI_Allreduce( &lnorm, &gnorm, 1,
MPI_DOUBLE, MPI_MAX, comm);
if( (MPI_ERR_PROC_FAILED == rc) ||
(MPI_ERR_COMM_REVOKED == rc) ||
(gnorm <= epsilon) ) {
if( MPI_ERR_PROC_FAILED == rc )
MPI_Comm_revoke(comm);
allsuceeded = (rc == MPI_SUCCESS);
MPI_Comm_agree(comm, &allsuceeded);
[email protected] — [email protected] Fault-tolerance for HPC 25/ 56
App. Specific FT
Iterative Algorithm
if( !allsucceeded ) {
MPI_Comm_revoke(comm);
MPI_Comm_shrink(comm, &comm2);
MPI_Comm_free(comm);
comm = comm2;
gnorm = epsilon + 1.0;
}
}
}
[email protected] — [email protected] Fault-tolerance for HPC 26/ 56
App. Specific FT
Outline
1 Application-specific fault-tolerance techniques (45mn)Fault-Tolerant MiddlewareBags of tasksIterative algorithms and fixed-point convergenceABFT for Linear Algebra applicationsComposite approach: ABFT & Checkpointing
[email protected] — [email protected] Fault-tolerance for HPC 27/ 56
App. Specific FT
Example: block LU/QR factorization
A A'
U
L
U
Solve A · x = b (hard)
Transform A into a LU factorization
Solve L · y = b, then U · x = y
[email protected] — [email protected] Fault-tolerance for HPC 28/ 56
App. Specific FT
Example: block LU/QR factorization
A A'
U
L
U
GETF2: factorize acolumn block
TRSM - Update row block
GEMM: Updatethe trailing
matrix
Solve A · x = b (hard)
Transform A into a LU factorization
Solve L · y = b, then U · x = y
[email protected] — [email protected] Fault-tolerance for HPC 28/ 56
App. Specific FT
Example: block LU/QR factorization
L
U U
L
U
GETF2: factorize acolumn block
TRSM - Update row block
GEMM: Updatethe trailing
matrix
L
U
Solve A · x = b (hard)
Transform A into a LU factorization
Solve L · y = b, then U · x = y
[email protected] — [email protected] Fault-tolerance for HPC 28/ 56
App. Specific FT
Example: block LU/QR factorization
0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5
0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5
0 21 30 21 30 21 30 21 3
0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5
0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5
0 21 30 21 30 21 30 21 3
Failure of rank 2
2D Block Cyclic Distribution (here 2× 3)
A single failure ⇒ many data lost
[email protected] — [email protected] Fault-tolerance for HPC 28/ 56
App. Specific FT
Algorithm Based Fault Tolerant LU decomposition
M
Pmb
nbQ
N< 2N/Q + nb
+++
0 2 41 3 5
0 2 41 3 5
0 21 3
0 2 41 3 5
0 2 41 3 5
0 21 3
0 2 41 3 5
0 2 41 3 5
0 21 3
0 2 41 3 5
0 2 41 3 5
0 21 3
0 2 41 3 5
0 2 41 3 5
0 21 3
0 2 41 3 5
0 2 41 3 5
0 21 3
0 2 41 3 5
0 2 41 3 5
0 21 3
0 2 41 3 5
0 2 41 3 5
0 21 3
45
0 21 3
45
0 21 3
45
0 21 3
45
0 21 3
45
0 21 3
45
0 21 3
45
0 21 3
45
0 21 3
Checksum: invertible operation on the data of the row /column
Checksum blocks are doubled, to allow recovery when dataand checksum are lost together
[email protected] — [email protected] Fault-tolerance for HPC 29/ 56
App. Specific FT
Algorithm Based Fault Tolerant LU decomposition
M
P mb
nbQ
NN/Q
+++
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 21 3
0 2 41 3 5
0 2 41 3 5
0 21 3
0 2 41 3 5
0 2 41 3 5
0 21 3
0 2 41 3 5
0 2 41 3 5
0 21 3
Checksum: invertible operation on the data of the row /column
Checksum replication can be avoided by dedicating computingresources to checksum storage
[email protected] — [email protected] Fault-tolerance for HPC 29/ 56
App. Specific FT
Algorithm Based Fault Tolerant LU decomposition
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
GETF2 GEMM
TRSM
Checksum: invertible operation on the data of the row /column
Idea of ABFT: applying the operation on data and checksumpreserves the checksum properties
[email protected] — [email protected] Fault-tolerance for HPC 29/ 56
App. Specific FT
Algorithm Based Fault Tolerant LU decomposition
+
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
Checksum: invertible operation on the data of the row /column
For the part of the data that is not updated this way, thechecksum must be re-calculated
[email protected] — [email protected] Fault-tolerance for HPC 29/ 56
App. Specific FT
Algorithm Based Fault Tolerant LU decomposition
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
GETF2 GEMM
TRSM
Checksum: invertible operation on the data of the row /column
To avoid slowing down all processors and panel operation,group checksum updates every q block columns
[email protected] — [email protected] Fault-tolerance for HPC 29/ 56
App. Specific FT
Algorithm Based Fault Tolerant LU decomposition
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
GETF2 GEMM
TRSM
Checksum: invertible operation on the data of the row /column
To avoid slowing down all processors and panel operation,group checksum updates every q block columns
[email protected] — [email protected] Fault-tolerance for HPC 29/ 56
App. Specific FT
Algorithm Based Fault Tolerant LU decomposition
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
GETF2 GEMM
TRSM
Checksum: invertible operation on the data of the row /column
To avoid slowing down all processors and panel operation,group checksum updates every q block columns
[email protected] — [email protected] Fault-tolerance for HPC 29/ 56
App. Specific FT
Algorithm Based Fault Tolerant LU decomposition
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
+
Checksum: invertible operation on the data of the row /column
Then, update the missing coverage. Keep checkpoint blockcolumn to cover failures during that time
[email protected] — [email protected] Fault-tolerance for HPC 29/ 56
App. Specific FT
Algorithm Based Fault Tolerant LU decomposition
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
---
In case of failure, conclude the operation, then
Missing Data = Checksum - Sum(Existing Data) s
[email protected] — [email protected] Fault-tolerance for HPC 30/ 56
App. Specific FT
Algorithm Based Fault Tolerant LU decomposition
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
+++
In case of failure, conclude the operation, then
Missing Checksum = Sum(Existing Data)s
[email protected] — [email protected] Fault-tolerance for HPC 30/ 56
App. Specific FT
ABFT LU decomposition: implementation
MPI Implementation
PBLAS-based: need to provide “Fault-Aware” version of thelibrary
Cannot enter recovery state at any point in time: need tocomplete ongoing operations despite failures
Recovery starts by defining the position of each process in thefactorization and bring them all in a consistent state(checksum property holds)
Need to test the return code of each and every MPI-relatedcall
[email protected] — [email protected] Fault-tolerance for HPC 31/ 56
App. Specific FT
ABFT LU decomposition: performance
As supercomputers grow ever larger in scale, the Mean Time to Failure becomes shorter and shorter, making the complete and successful execution of complex applications more and more difficult. FT-LA delivers a new approach, utilizing Algorithm-Based Fault Tolerance (ABFT), to help factorization algorithms survive fail-stop failures. The FT-LA software package extends ScaLAPACK with ABFT routines, and in sharp contrast with legacy checkpoint-based approaches, ABFT does not incur I/O overhead, and promises a much more scalable protection scheme.
ABFT THE IDEA
Cost of ABFT comes only from extra flops (to update checksums) and extra storage
Cost decreases with machine scale (divided by Q when using PxQ processes)
PROTECTION
Matrix protected by block row checksum
The algorithm updates both the trailing matrix AND the checksums
RECOVERY
Missing blocks reconstructed by inverting the checksum operation
FUNCTIONALITY COVERAGE
Linear Systems of Equations
Least Squares
Cholesky, LU
QR (with protection of the upper and lower factors)
FEATURES
WORK IN PROGRESS
Covering four precisions: double complex, single complex, double real, single real (ZCDS)
Deploys on MPI FT draft (ULFM), or with “Checkpoint-on-failure”
Allows toleration of permanent crashes
Hessenber Reduction, Soft (silent) Errors
Process grid: p x qF: simultaneous failures tolerated
Protection against 2 faults on 192x192 processes => 1% overhead
Usually F << q; Overheads in F/q
Protection cost is inversely proportional to machine scale!
Computation
Memory
Flops for the checksum update
Matrix is extended with 2F columns every q columns
FIND OUT MORE AT http://icl.cs.utk.edu/ft-la
0
7
14
21
28
35
6x6; 20k12x12; 40k
24x24; 80k48x48; 160k
96x96; 320k192x192; 640k 0
10
20
30
40
50
Rela
tive
Over
head
(%)
Perfo
rman
ce (T
Flop
/s)
#Processors (PxQ grid); Matrix size (N)
ScaLAPACK PDGETRFFT-PDGETRF (no error)
FT-PDGETRF (w/1 recovery)Overhead: FT-PDGETRF (no error)
Overhead: FT-PDGETRF (w/1 recovery)
U
L
C’
GETF2 GEMM
TRSM
A’
L
0 4 6 0 4 6
1 3 5 7 1 3 5 7
0 4 6 0 4 6
1 3 5 7 1 3 5 7
0 4 6 0 4 6
1 3 5 7 1 3 5 7
0 4 6 0 4 6
1 3 5 7 1 3 5 7
0 4 6 0 4 6
1 3 5 7 1 3 5 7
C
PERFORMANCE ON KRAKEN
MPI-Next ULFM Performance
Open MPI with ULFM; Kraken supercomputer;
[email protected] — [email protected] Fault-tolerance for HPC 32/ 56
App. Specific FT
ABFT LU decomposition: implementation
?
ABFTRecovery
Checkpoint on Failure - MPI Implementation
FT-MPI / MPI-Next FT: not easily available on largemachines
Checkpoint on Failure = workaround
[email protected] — [email protected] Fault-tolerance for HPC 33/ 56
App. Specific FT
ABFT QR decomposition: performance
0
0.5
1
1.5
2
2.5
3
3.5
20k 40k 60k 80k 100k
Pe
rfo
rma
nc
e (
Tfl
op
s/s
)
Matrix Size (N)
ScaLAPACK QRCoF-QR (w/o failure)CoF-QR (w/1 failure)
Fig. 2. ABFT QR andone CoF recovery onKraken (Lustre).
0
100
200
300
400
500
600
700
800
10k 20k 30k 40k 50k
Pe
rfo
rma
nc
e (
Gfl
op
s/s
)
Matrix Size (N)
ScaLAPACK QRCoF-QR (w/o failure)CoF-QR (w/1 failure)
Fig. 3. ABFT QR andone CoF recovery onDancer (local SSD).
0
1
2
3
4
5
6
7
20k 25k 30k 35k 40k 45k 50k
Ap
pli
ca
tio
n T
ime
Sh
are
(%
)
Matrix Size (N)
Load CheckpointDump Checkpoint
ABFT Recovery
Fig. 4. Time breakdownof one CoF recovery onDancer (local SSD).
5.3 Checkpoint-on-Failure QR Performance
Supercomputer Performance: Figure 2 presents the performance on the Krakensupercomputer. The process grid is 24⇥24 and the block size is 100. The CoF-QR(no failure) presents the performance of the CoF QR implementation, in a fault-free execution; it is noteworthy, that when there are no failures, the performanceis exactly identical to the performance of the unmodified FT-QR implementa-tion. The CoF-QR (with failure) curves present the performance when a failureis injected after the first step of the PDLARFB kernel. The performance of thenon-fault tolerant ScaLAPACK QR is also presented for reference.
Without failures, the performance overhead compared to the regular ScaLA-PACK is caused by the extra computation to maintain the checksums inherentto the ABFT algorithm [12]; this extra computation is unchanged between CoF-QR and FT-QR. Only on runs where a failure happened do the CoF protocolsundergoe the supplementary overhead of storing and reloading checkpoints. How-ever, the performance of the CoF-QR remains very close to the no-failure case.For instance, at matrix size N=100,000, CoF-QR still achieves 2.86 Tflop/s afterrecovering from a failure, which is 90% of the performance of the non-fault toler-ant ScaLAPACK QR. This demonstrates that the CoF protocol enables e�cient,practical recovery schemes on supercomputers.
Impact of Local Checkpoint Storage: Figure 3 presents the performance of theCoF-QR implementation on the Dancer cluster with a 8 ⇥ 16 process grid. Al-though a smaller test platform, the Dancer cluster features local storage on nodesand a variety of performance analysis tools unavailable on Kraken. As expected(see [12]), the ABFT method has a higher relative cost on this smaller machine.Compared to the Kraken platform, the relative cost of CoF failure recovery issmaller on Dancer. The CoF protocol incurs disk accesses to store and loadcheckpoints when a failure hits, hence the recovery overhead depends on I/Operformance. By breaking down the relative cost of each recovery step in CoF,Figure 4 shows that checkpoint saving and loading only take a small percentageof the total run-time, thanks to the availability of solid state disks on every node.Since checkpoint reloading immediately follows checkpointing, the OS cache sat-isfy most disk access, resulting in high I/O performance. For matrices larger than
Checkpoint on Failure - MPI Performance
Open MPI; Kraken supercomputer;
[email protected] — [email protected] Fault-tolerance for HPC 34/ 56
App. Specific FT
Outline
1 Application-specific fault-tolerance techniques (45mn)Fault-Tolerant MiddlewareBags of tasksIterative algorithms and fixed-point convergenceABFT for Linear Algebra applicationsComposite approach: ABFT & Checkpointing
[email protected] — [email protected] Fault-tolerance for HPC 35/ 56
App. Specific FT
Fault Tolerance Techniques
General Techniques
Replication
Rollback Recovery
Coordinated CheckpointingUncoordinated Checkpointing &Message LoggingHierarchical Checkpointing
Application-Specific Techniques
Algorithm Based Fault Tolerance(ABFT)
Iterative Convergence
Approximated Computation
[email protected] — [email protected] Fault-tolerance for HPC 36/ 56
App. Specific FT
Application
Typical Application
f o r ( an insanenumber ) {/∗ E x t r a c t data from∗ s i m u l a t i o n , f i l l up∗ m a t r i x ∗/
sim2mat ( ) ;
/∗ F a c t o r i z e matr ix ,∗ S o l v e ∗/
d g e q r f ( ) ;d s o l v e ( ) ;
/∗ Update s i m u l a t i o n∗ w i t h r e s u l t v e c t o r ∗/
vec2s im ( ) ;}
Process 0
Process 1
Process 2
Application
Application
Application
Library
Library
Library
LIBRARY Phase GENERAL Phase
Characteristics
, Large part of (total)computation spent infactorization/solve
Between LA operations:
/ use resulting vector / matrixwith operations that do notpreserve the checksums onthe data
/ modify data not covered byABFT algorithms
[email protected] — [email protected] Fault-tolerance for HPC 37/ 56
App. Specific FT
Application
Typical Application
f o r ( an insanenumber ) {/∗ E x t r a c t data from∗ s i m u l a t i o n , f i l l up∗ m a t r i x ∗/
sim2mat ( ) ;
/∗ F a c t o r i z e matr ix ,∗ S o l v e ∗/
d g e q r f ( ) ;d s o l v e ( ) ;
/∗ Update s i m u l a t i o n∗ w i t h r e s u l t v e c t o r ∗/
vec2s im ( ) ;}
Process 0
Process 1
Process 2
Application
Application
Application
Library
Library
Library
LIBRARY Phase GENERAL Phase
Characteristics
, Large part of (total)computation spent infactorization/solve
Between LA operations:
/ use resulting vector / matrixwith operations that do notpreserve the checksums onthe data
/ modify data not covered byABFT algorithms
Goodbye ABFT?!
[email protected] — [email protected] Fault-tolerance for HPC 37/ 56
App. Specific FT
Application
Typical Application
f o r ( an insanenumber ) {/∗ E x t r a c t data from∗ s i m u l a t i o n , f i l l up∗ m a t r i x ∗/
sim2mat ( ) ;
/∗ F a c t o r i z e matr ix ,∗ S o l v e ∗/
d g e q r f ( ) ;d s o l v e ( ) ;
/∗ Update s i m u l a t i o n∗ w i t h r e s u l t v e c t o r ∗/
vec2s im ( ) ;}
Process 0
Process 1
Process 2
Application
Application
Application
Library
Library
Library
LIBRARY Phase GENERAL Phase
Characteristics
, Large part of (total)computation spent infactorization/solve
Between LA operations:
/ use resulting vector / matrixwith operations that do notpreserve the checksums onthe data
/ modify data not covered byABFT algorithms
Problem Statement
How to use fault tolerant operations(∗) within anon-fault tolerant(∗∗) application?(∗∗∗)
(*) ABFT, or other application-specific FT(**) Or within an application that does not have the same kind of FT
(***) And keep the application globally fault tolerant...
[email protected] — [email protected] Fault-tolerance for HPC 37/ 56
App. Specific FT
ABFT&PeriodicCkpt
ABFT&PeriodicCkpt: no failure
Process 0
Process 1
Process 2
Application
Application
Application
Library
Library
Library
PeriodicCheckpoint
SplitForced
Checkpoints
[email protected] — [email protected] Fault-tolerance for HPC 38/ 56
App. Specific FT
ABFT&PeriodicCkpt
ABFT&PeriodicCkpt: failure during Library phase
Process 0
Process 1
Process 2
Application
Application
Application
Library
Library
Library
Failure(during LIBRARY)
Rollback(partial)
Recovery
ABFTRecovery
[email protected] — [email protected] Fault-tolerance for HPC 39/ 56
App. Specific FT
ABFT&PeriodicCkpt
ABFT&PeriodicCkpt: failure during General phase
Process 0
Process 1
Process 2
Application
Application
Application
Library
Library
Library
Failure(during GENERAL)
Rollback(fulll)
Recovery
[email protected] — [email protected] Fault-tolerance for HPC 40/ 56
App. Specific FT
ABFT&PeriodicCkpt: Optimizations
Process 0
Process 1
Process 2
Application
Application
Application
Library
Library
Library
ABFT&PERIO
DICCKPT
ABFT&PeriodicCkpt: Optimizations
If the duration of the General phase is too small: don’t addcheckpoints
If the duration of the Library phase is too small: don’t doABFT recovery, remain in General mode
this assumes a performance model for the library call
[email protected] — [email protected] Fault-tolerance for HPC 41/ 56
App. Specific FT
ABFT&PeriodicCkpt: Optimizations
Process 0
Process 1
Process 2
Application
Application
Application
Library
Library
Library
ABFT&PERIO
DICCKPT
GENERALCheckpoint Interval
ABFT&PeriodicCkpt: Optimizations
If the duration of the General phase is too small: don’t addcheckpoints
If the duration of the Library phase is too small: don’t doABFT recovery, remain in General mode
this assumes a performance model for the library call
[email protected] — [email protected] Fault-tolerance for HPC 41/ 56
App. Specific FT
A few notations
Process 0
Process 1
Process 2
Application
Application
Application
Library
Library
Library
T0
TG TL
PG
Times, Periods
T0: Duration of an Epoch (without FT)TL = αT0: Time spent in the Library phaseTG = (1− α)T0: Time spent in the General phasePG : Periodic Checkpointing PeriodTff,Tff
G ,TffL : “Fault Free” times
t lostG , t lost
L : Lost time (recovery overhreads)T finalG ,T final
L : Total times (with faults)
[email protected] — [email protected] Fault-tolerance for HPC 42/ 56
App. Specific FT
A few notations
Process 0
Process 1
Process 2
Application
Application
Application
Library
Library
Library
C CLCL
Costs
CL = ρC : time to take a checkpoint of the Library data setCL = (1− ρ)C : time to take a checkpoint of the General datasetR,RL: time to load a full / General data set checkpointD: down time (time to allocate a new machine / reboot)ReconsABFT: time to apply the ABFT recoveryφ: Slowdown factor on the Library phase, when applying ABFT
[email protected] — [email protected] Fault-tolerance for HPC 42/ 56
App. Specific FT
General phase, fault free waste
General phase
Process 0
Process 1
Process 2
Application
Application
Application
Library
Library
Library
PeriodicCheckpoint
SplitForced
Checkpoints
Without Failures
TffG =
{TG + CL if TG < PGTG
PG−C × PG if TG ≥ PG
[email protected] — [email protected] Fault-tolerance for HPC 43/ 56
App. Specific FT
Library phase, fault free waste
Library phase
Process 0
Process 1
Process 2
Application
Application
Application
Library
Library
Library
PeriodicCheckpoint
SplitForced
Checkpoints
Without Failures
TffL = φ× TL + CL
[email protected] — [email protected] Fault-tolerance for HPC 44/ 56
App. Specific FT
General phase, failure overhead
General phase
Process 0
Process 1
Process 2
Application
Application
Application
Library
Library
Library
Failure(during GENERAL)
Rollback(fulll)
Recovery
Failure Overhead
t lostG =
{D + R +
TffG
2 if TG < PG
D + R + PG2 if TG ≥ PG
[email protected] — [email protected] Fault-tolerance for HPC 45/ 56
App. Specific FT
Library phase, failure overhead
Library phase
Process 0
Process 1
Process 2
Application
Application
Application
Library
Library
Library
Failure(during LIBRARY)
Rollback(partial)
Recovery
ABFTRecovery
Failure Overhead
t lostL = D + RL + ReconsABFT
[email protected] — [email protected] Fault-tolerance for HPC 46/ 56
App. Specific FT
Overall
Overall
Time (with overheads) of Library phase is constant (in PG ):
T finalL =
1
1− D+RL+ReconsABFT
µ
× (α× TL + CL)
Time (with overehads) of General phase accepts two cases:
T finalG =
1
1−D+R+TG +C
L2
µ
× (TG + CL) if TG < PG
TG
(1− CPG
)(1−D+R+PG
2µ
)
if TG ≥ PG
Which is minimal in the second case, if
PG =√
2C (µ− D − R)
[email protected] — [email protected] Fault-tolerance for HPC 47/ 56
App. Specific FT
Waste
From the previous, we derive the waste, which is obtained by
Waste = 1− T0
T finalG + T final
L
[email protected] — [email protected] Fault-tolerance for HPC 48/ 56
App. Specific FT
Toward Exascale, and Beyond!
Let’s think at scale
Number of components ↗⇒ MTBF ↘Number of components ↗⇒ Problem Size ↗Problem Size ↗⇒
Computation Time spent in Library phase ↗
, ABFT&PeriodicCkpt should perform better with scale
ij/ By how much?
[email protected] — [email protected] Fault-tolerance for HPC 49/ 56
App. Specific FT
Competitors
FT algorithms compared
PeriodicCkpt Basic periodic checkpointing
Bi-PeriodicCkpt Applies incremental checkpointing techniques tosave only the library data during the library phase.
ABFT&PeriodicCkpt The algorithm described above
[email protected] — [email protected] Fault-tolerance for HPC 50/ 56
App. Specific FT
Weak Scale #1
Weak Scale Scenario #1
Number of components, n, increase
Memory per component remains constant
Problem Size increases in O(√n) (e.g. matrix operation)
µ at n = 105: 1 day, is in O( 1n )
C (=R) at n = 105, is 1 minute, is in O(n)
α is constant at 0.8, as is ρ.
(both Library and General phase increase in time at thesame speed)
[email protected] — [email protected] Fault-tolerance for HPC 51/ 56
App. Specific FT
Weak Scale #1
0
10
20
30
40
# F
au
lts Nb Faults PeriodicCkpt
Nb Faults Bi-PeriodicCkptNb Faults ABFT PeriodicCkpt
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1k 10k 100k 1M
Wa
ste
Nodes
PeriodicCkptBi-PeriodicCkpt
ABFT PeriodicCkpt
[email protected] — [email protected] Fault-tolerance for HPC 52/ 56
App. Specific FT
Weak Scale #2
Weak Scale Scenario #2
Number of components, n, increase
Memory per component remains constant
Problem Size increases in O(√n) (e.g. matrix operation)
µ at n = 105: 1 day, is O( 1n )
C (=R) at n = 105, is 1 minute, is in O(n)
ρ remains constant at 0.8, but Library phase is O(n3) whenGeneral phases progresses in O(n2) (α is 0.8 at n = 105
nodes).
[email protected] — [email protected] Fault-tolerance for HPC 53/ 56
App. Specific FT
Weak Scale #2
0
10
20
30
40
# F
aults Nb Faults PeriodicCkpt
Nb Faults Bi-PeriodicCkptNb Faults ABFT PeriodicCkpt
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1k 10k 100k 1M0.00
0.12
0.25
0.38
0.50
0.62
0.75
0.88
1.00
Waste
Ratio o
f tim
e s
pent in
the A
BF
T r
outine
Nodes
PeriodicCkptBi-PeriodicCkpt
ABFT PeriodicCkptABFT Ratio
[email protected] — [email protected] Fault-tolerance for HPC 54/ 56
App. Specific FT
Weak Scale #3
Weak Scale Scenario #3
Number of components, n, increase
Memory per component remains constant
Problem Size increases in O(√n) (e.g. matrix operation)
µ at n = 105: 1 day, is O( 1n )
C (=R) at n = 105, is 1 minute, stays independent of n(O(1))
ρ remains constant at 0.8, but Library phase is O(n3) whenGeneral phases progresses in O(n2) (α is 0.8 at n = 105
nodes).
[email protected] — [email protected] Fault-tolerance for HPC 55/ 56
App. Specific FT
Weak Scale #3
0
2
4
6
# F
au
lts
Nb Faults PeriodicCkptNb Faults Bi-PeriodicCkpt
Nb Faults ABFT PeriodicCkpt
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1k! = 0.55
10k! = 0.8
100k! = 0.92
1M! = 0.975
Wa
ste
Nodes
PeriodicCkptBi-PeriodicCkpt
ABFT PeriodicCkpt
[email protected] — [email protected] Fault-tolerance for HPC 56/ 56