73
App. Specific FT An overview of fault-tolerant techniques for HPC Thomas H´ erault 1 & Yves Robert 1,2 1 – University of Tennessee Knoxville 2 – ENS Lyon & Institut Universitaire de France [email protected] | [email protected] http://graal.ens-lyon.fr/ ~ yrobert/sc13tutorial.pdf SC’2013 Tutorial [email protected][email protected] Fault-tolerance for HPC 1/ 56

An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

An overview of fault-tolerant techniques for HPC

Thomas Herault1 & Yves Robert1,2

1 – University of Tennessee Knoxville2 – ENS Lyon & Institut Universitaire de France

[email protected] | [email protected]

http://graal.ens-lyon.fr/~yrobert/sc13tutorial.pdf

SC’2013 Tutorial

[email protected][email protected] Fault-tolerance for HPC 1/ 56

Page 2: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Thanks

INRIA & ENS Lyon

Anne Benoit

Frederic Vivien

PhD students (Guillaume Aupy, Dounia Zaidouni)

UT Knoxville

George Bosilca

Aurelien Bouteiller

Jack Dongarra

Others

Franck Cappello, Argonne and UIUC-Inria joint lab

Henri Casanova, Univ. Hawai‘i

Amina Guermouche, UIUC-Inria joint lab

[email protected][email protected] Fault-tolerance for HPC 2/ 56

Page 3: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Outline

1 Application-specific fault-tolerance techniques (45mn)Fault-Tolerant MiddlewareBags of tasksIterative algorithms and fixed-point convergenceABFT for Linear Algebra applicationsComposite approach: ABFT & Checkpointing

[email protected][email protected] Fault-tolerance for HPC 3/ 56

Page 4: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Outline

1 Application-specific fault-tolerance techniques (45mn)Fault-Tolerant MiddlewareBags of tasksIterative algorithms and fixed-point convergenceABFT for Linear Algebra applicationsComposite approach: ABFT & Checkpointing

[email protected][email protected] Fault-tolerance for HPC 4/ 56

Page 5: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Fault Tolerance Software Stack

Application

Lib1 Lib2

Comm. Middleware (MPI)

OS

Network

RuntimeHelpers

[email protected][email protected] Fault-tolerance for HPC 5/ 56

Page 6: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Fault Tolerance Software Stack

Application

Lib1 Lib2

Comm. Middleware (MPI)

OS

Network

RuntimeHelpers

NetworkTransientFailures

(inc. msg corruption)Fault Tolerance

AutomaticPermanent

CrashFault Tolerance

Application-BasedPermanent

CrashFault Tolerance

PermanentCrash

Detection

[email protected][email protected] Fault-tolerance for HPC 5/ 56

Page 7: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Motivation

Motivation

Generality can prevent Efficiency

Specific solutions exploit more capability, have moreopportunity to extract efficiency

Naturally Fault Tolerant Applications

[email protected][email protected] Fault-tolerance for HPC 6/ 56

Page 8: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Outline

1 Application-specific fault-tolerance techniques (45mn)Fault-Tolerant MiddlewareBags of tasksIterative algorithms and fixed-point convergenceABFT for Linear Algebra applicationsComposite approach: ABFT & Checkpointing

[email protected][email protected] Fault-tolerance for HPC 7/ 56

Page 9: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

HPC – MPI

HPC

Most popular middleware for multi-node programming inHPC: Message Passing Interface (+Open MP +pthread +...)

Fault Tolerance in MPI:

[...] it is the job of the implementor of the MPIsubsystem to insulate the user from this unreliability,or to reflect unrecoverable errors as failures.Whenever possible, such failures will be reflected aserrors in the relevant communication call. Similarly,MPI itself provides no mechanisms for handlingprocessor failures.

– MPI Standard 3.0, p. 20, l. 36:39

[email protected][email protected] Fault-tolerance for HPC 8/ 56

Page 10: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

HPC – MPI

HPC

Most popular middleware for multi-node programming inHPC: Message Passing Interface (+Open MP +pthread +...)

Fault Tolerance in MPI:This document does not specify the state of acomputation after an erroneous MPI call hasoccurred.

– MPI Standard 3.0, p. 21, l. 24:25

[email protected][email protected] Fault-tolerance for HPC 8/ 56

Page 11: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

HPC – MPI

MPI Implementations

Open MPI (http://www.open-mpi.org)

On failure detection, the runtime system kills all processestrunk: error is never reported to the MPI processes.ft-branch: the error is reported, MPI might be partly usable.

MPICH (http://www.mcs.anl.gov/mpi/mpich/)

Default: on failure detection, the runtime kills all processes.Can be de-activated by a runtime switchErrors might be reported to MPI processes in that case. MPImight be partly usable.

[email protected][email protected] Fault-tolerance for HPC 9/ 56

Page 12: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

FT Middleware in HPC

Not MPI. Sockets, PVM... CCI?http://www.olcf.ornl.gov/center-projects/

common-communication-interface/ UCCS?

FT-MPI: http://icl.cs.utk.edu/harness/, 2003

MPI-Next-FT proposal (Open MPI, MPICH): ULFM

User-Level Failure Mitigationhttp://fault-tolerance.org/ulfm/

Checkpoint on Failures: the rejuvenation in HPC

[email protected][email protected] Fault-tolerance for HPC 10/ 56

Page 13: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

MPI-Next-FT proposal: ULFM

Goal

Resume Communication Capability for MPI (and nothing more)

Failure Reporting

Failure notification propagation / Distributed Statereconciliation

=⇒ In the past, these operations have often been merged=⇒ this incurs high failure free overheads

ULFM splits these steps and gives control to the user

Recovery

Termination

[email protected][email protected] Fault-tolerance for HPC 11/ 56

Page 14: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

MPI-Next-FT proposal: ULFM

Goal

Resume Communication Capability for MPI (and nothing more)

Error reporting indicates impossibility to carry an operation

State of MPI is unchanged for operations that can continue(i.e. if they do not involve a dead process)

Errors are non uniformly returned

(Otherwise, synchronizing semantic is altered drastically withhigh performance impact)

New APIs

REVOKE allows to resolve non-uniform error status

SHRINK allows to rebuild error-free communicators

AGREE allows to quit a communication pattern knowing it isfully complete

[email protected][email protected] Fault-tolerance for HPC 12/ 56

Page 15: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

MPI-Next-FT proposal: ULFM

Errors are visible only for operations thatcannot complete

Error Reporting

Operations that cannot complete return

ERR PROC FAILED, or ERR PENDING ifappropriateState of MPI Objects is unchanged(communicators etc.)Repeating the same operation has thesame outcome

Operations that can be completed returnMPI SUCCESS

point to point operations betweennon-failed ranks can continue

Errors are visible only for operations that can’t complete •  Operations that can’t complete return

ERR_PROC_FAILED •  State of MPI objects unchanged

(communicators, etc) •  Repeating the same operation has the same

outcome •  Operations that can be completed

return MPI_SUCCESS •  Pt-2-pt operations between non failed ranks

can continue

S(1) PF

time

R(2) R(1)

R(2)

S(2)

S(3)

S R(0) S

PF

S(0) S

S S(2)

R(3) S

S

S S

[email protected][email protected] Fault-tolerance for HPC 13/ 56

Page 16: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

MPI-Next-FT proposal: ULFM

Inconsistent Global State and Resolution

Error Reporting

Operations that can’t complete return

ERR PROC FAILED, or ERR PENDING ifappropriate

Operations that can be completed returnMPI SUCCESS

Local semantic is respected (buffercontent is defined), this does notindicate success at other ranks.New constructsMPI Comm Revoke/MPI Comm shrink

are a base to resolve inconsistenciesintroduced by failure

Incoherent global state and resolution •  Operations that can’t complete return

ERR_PROC_FAILED •  Operations that can be completed

return MPI_SUCCESS •  local semantic is respected (that is buffer

content is defined), it does not indicate success at other ranks!

•  New constructs Comm_Revoke resolves inconsistencies introduced by failures

Bcast S S PF

Bcast Revoke

R R

Shrink

Bcast S S S

time

[email protected][email protected] Fault-tolerance for HPC 14/ 56

Page 17: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

MPI-Next-FT proposal: ULFM

Resilience Extensions for MPI: ULFMULFM provides targeted interfaces to empower recovery strategies with adequate options to restore communication capabilities and global consistency, at the necessary levels only.

Sequoia AMG is an unstructured physics mesh application with a complex communication pattern that employs both point-to-point and collective operations. Its failure free performance is unchanged whether it is deployed with ULFM or normal Open MPI.

The failure of rank 3 is detected and managed by rank 2 during the 512 bytes message test. The connectivity and bandwidth between rank 0 and rank 1 are unaffected by failure handling activities at rank 2.

CONTINUE ACROSS ERRORS

In ULFM, failures do not alter the state of MPI communicators. Point-to-point operations can continue undisturbed between non-faulty processes. ULFM imposes no recovery cost on simple communication patterns that can proceed despite failures.

GROUP EXCEPTIONS

Consistent reporting of failures would add an unacceptable performance penalty. In ULFM, errors are raised only at ranks where an operation is disrupted; other ranks may still complete their operations. A process can use MPI_[Comm,Win,File]_revoke to propagate an error notification on the entire group, and could, for example, interrupt other ranks to join a coordinated recovery.

COLLECTIVE OPERATIONS

Allowing collective operations to operate on damaged MPI objects (Communicators, RMA windows or Files) would incur unacceptable overhead. The MPI_Comm_shrink routine builds a replacement communicator, excluding failed processes, which can be used to resume collective communications, spawn replacement processes, and rebuild RMA Windows and Files.

Master

W1

W2

Wn

Send (W1,T1)Submit T1

Send (W2,T1)Resubmit

Recv (ANY)Detected W1

Recv(P1): failureP2 calls RevokeP1

P2

P3

Pn

Recv(P1) Recv(P1): revoked

Recovery

P1

P2

P3

Pn

Bcast

Bcast

Shrink

Bcast

BA

ND

WID

TH

(G

bit

/s)

MESSAGE SIZE (Bytes)

ULFM Fault Tolerant MPI Performance with failuresIMB Ping-pong between ranks 0 and 1 (IB20G)

Open MPIFT Open MPI (w/failure at rank 3)

0

1

2

3

4

5

6

7

8

9

10

11

12

1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M

LA

TE

NC

Y (

us)

4

4.5

5

5.5

6

6.5

7

7.5

8

1 4 16 64 256 1K

-1%

-0.5%

+0%

+0.5%

+1%

8 16 32 64 128 256 512

DIF

FE

RE

NC

E IN

RU

NN

ING

TIM

E

NUMBER OF PROCESSES

Sequoia AMG Performance with Fault Tolerance

No

n-F

T is

fa

ste

rU

LF

M is

fa

ste

r

OPEN MPI ULFM IMPLEMENTATION PERFORMANCE

Resilience Extensions for MPI: ULFMULFM provides targeted interfaces to empower recovery strategies with adequate options to restore communication capabilities and global consistency, at the necessary levels only.

Sequoia AMG is an unstructured physics mesh application with a complex communication pattern that employs both point-to-point and collective operations. Its failure free performance is unchanged whether it is deployed with ULFM or normal Open MPI.

The failure of rank 3 is detected and managed by rank 2 during the 512 bytes message test. The connectivity and bandwidth between rank 0 and rank 1 are unaffected by failure handling activities at rank 2.

CONTINUE ACROSS ERRORS

In ULFM, failures do not alter the state of MPI communicators. Point-to-point operations can continue undisturbed between non-faulty processes. ULFM imposes no recovery cost on simple communication patterns that can proceed despite failures.

GROUP EXCEPTIONS

Consistent reporting of failures would add an unacceptable performance penalty. In ULFM, errors are raised only at ranks where an operation is disrupted; other ranks may still complete their operations. A process can use MPI_[Comm,Win,File]_revoke to propagate an error notification on the entire group, and could, for example, interrupt other ranks to join a coordinated recovery.

COLLECTIVE OPERATIONS

Allowing collective operations to operate on damaged MPI objects (Communicators, RMA windows or Files) would incur unacceptable overhead. The MPI_Comm_shrink routine builds a replacement communicator, excluding failed processes, which can be used to resume collective communications, spawn replacement processes, and rebuild RMA Windows and Files.

Master

W1

W2

Wn

Send (W1,T1)Submit T1

Send (W2,T1)Resubmit

Recv (ANY)Detected W1

Recv(P1): failureP2 calls RevokeP1

P2

P3

Pn

Recv(P1) Recv(P1): revoked

Recovery

P1

P2

P3

Pn

Bcast

Bcast

Shrink

Bcast

BA

ND

WID

TH

(G

bit

/s)

MESSAGE SIZE (Bytes)

ULFM Fault Tolerant MPI Performance with failuresIMB Ping-pong between ranks 0 and 1 (IB20G)

Open MPIFT Open MPI (w/failure at rank 3)

0

1

2

3

4

5

6

7

8

9

10

11

12

1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M

LA

TE

NC

Y (

us)

4

4.5

5

5.5

6

6.5

7

7.5

8

1 4 16 64 256 1K

-1%

-0.5%

+0%

+0.5%

+1%

8 16 32 64 128 256 512

DIF

FE

RE

NC

E IN

RU

NN

ING

TIM

E

NUMBER OF PROCESSES

Sequoia AMG Performance with Fault Tolerance

No

n-F

T is f

aste

rU

LF

M is f

aste

r

OPEN MPI ULFM IMPLEMENTATION PERFORMANCE

Open MPI - ULFM support

Branch of Open MPI (www.open-mpi.org)

Maintained on bitbucket:https://bitbucket.org/icldistcomp/ulfm

[email protected][email protected] Fault-tolerance for HPC 15/ 56

Page 18: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Outline

1 Application-specific fault-tolerance techniques (45mn)Fault-Tolerant MiddlewareBags of tasksIterative algorithms and fixed-point convergenceABFT for Linear Algebra applicationsComposite approach: ABFT & Checkpointing

[email protected][email protected] Fault-tolerance for HPC 16/ 56

Page 19: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Master/WorkerExample: Master-worker

MPI_Irecv_init(comm, ANY_SOURCE, work_done) While(more_work && workers) { submit_work(worker[i++ % workers]) rc = MPI_Test(work_done) if(MPI_SUCCESS != RC) { MPI_COMM_FAILURE_ACK(comm) MPI_COMM_FAILURE_GET_ACKED(comm, i) worker[i] = worker[workers--] resubmit_work(worker[i], i) } }

a b

c

d

b

e

Master

Worker0 Worker1 Worker2

Worker

while(1) {

MPI_Recv( master, &work );

if( work == STOP_CMD )

break;

process_work(work, &result);

MPI_Send( master, result );

}

[email protected][email protected] Fault-tolerance for HPC 17/ 56

Page 20: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Master/Worker

Master

for(i = 0; i < active_workers; i++) {

new_work = select_work();

MPI_Send(i, new_work);

}

while( active_workers > 0 ) {

MPI_Wait( MPI_ANY_SOURCE, &worker );

MPI_Recv( worker, &work );

work_completed(work);

if( work_tocomplete() == 0 ) break;

new_work = select_work();

if( new_work) MPI_Send( worker, new_work );

}

for(i = 0; i < active_workers; i++) {

MPI_Send(i, STOP_CMD);

}

[email protected][email protected] Fault-tolerance for HPC 18/ 56

Page 21: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

FT Master

Fault Tolerant Master

/* Non-FT preamble */

for(i = 0; i < active_workers; i++) {

new_work = select_work();

rc = MPI_Send(i, new_work);

if( MPI_SUCCESS != rc ) MPI_Abort(MPI_COMM_WORLD);

}

/* FT Section */

<...>

/* Non-FT epilogue */

for(i = 0; i < active_workers; i++) {

rc = MPI_Send(i, STOP_CMD);

if( MPI_SUCCESS != rc ) MPI_Abort(MPI_COMM_WORLD);

}

[email protected][email protected] Fault-tolerance for HPC 19/ 56

Page 22: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

FT Master

Fault Tolerant Master

while( active_workers > 0 ) { /* FT Section */

rc = MPI_Wait( MPI_ANY_SOURCE, &worker );

switch( rc ) {

case MPI_SUCCESS: /* Received a result */

break;

case MPI_ERR_PENDING:

case MPI_ERR_PROC_FAILED: /* Worker died */

<...>

continue;

break;

default:

/* Unknown error, not related to failure */

MPI_Abort(MPI_COMM_WORLD);

}

<...>

[email protected][email protected] Fault-tolerance for HPC 20/ 56

Page 23: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

FT Master

Fault Tolerant Master

case MPI_ERR_PENDING:

case MPI_ERR_PROC_FAILED:

/* A worker died */

MPI_Comm_failure_ack(comm);

MPI_Comm_failure_get_acked(comm, &group);

MPI_Group_difference(group, failed,

&newfailed);

MPI_Group_size(newfailed, &ns);

active_workers -= ns;

/* Iterate on newfailed to mark the work

* as not submitted */

failed = group;

continue;

[email protected][email protected] Fault-tolerance for HPC 21/ 56

Page 24: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

FT Master

Fault Tolerant Master

rc = MPI_Recv( worker, &work );

switch( rc ) {

/* Code similar to the MPI_Wait code */

<...>

}

work_completed(work);

if( work_tocomplete() == 0 ) break;

new_work = select_work();

[email protected][email protected] Fault-tolerance for HPC 22/ 56

Page 25: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

FT Master

Fault Tolerant Master

if(new_work) {

rc = MPI_Send( worker, new_work );

switch( rc ) {

/* Code similar to the MPI_Wait code */

/* Re-submit the work somewhere */

<...>

}

}

} /* End of while( active_workers > 0 ) */

MPI_Group_difference(comm, failed, &living);

/* Iterate on living */

for(i = 0; i < active_workers; i++) {

MPI_Send(rank_of(comm, living, i), STOP_CMD);

}

[email protected][email protected] Fault-tolerance for HPC 23/ 56

Page 26: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Outline

1 Application-specific fault-tolerance techniques (45mn)Fault-Tolerant MiddlewareBags of tasksIterative algorithms and fixed-point convergenceABFT for Linear Algebra applicationsComposite approach: ABFT & Checkpointing

[email protected][email protected] Fault-tolerance for HPC 24/ 56

Page 27: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Iterative Algorithm

while( gnorm > epsilon ) {

iterate();

compute_norm(&lnorm);

rc = MPI_Allreduce( &lnorm, &gnorm, 1,

MPI_DOUBLE, MPI_MAX, comm);

if( (MPI_ERR_PROC_FAILED == rc) ||

(MPI_ERR_COMM_REVOKED == rc) ||

(gnorm <= epsilon) ) {

if( MPI_ERR_PROC_FAILED == rc )

MPI_Comm_revoke(comm);

allsuceeded = (rc == MPI_SUCCESS);

MPI_Comm_agree(comm, &allsuceeded);

[email protected][email protected] Fault-tolerance for HPC 25/ 56

Page 28: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Iterative Algorithm

if( !allsucceeded ) {

MPI_Comm_revoke(comm);

MPI_Comm_shrink(comm, &comm2);

MPI_Comm_free(comm);

comm = comm2;

gnorm = epsilon + 1.0;

}

}

}

[email protected][email protected] Fault-tolerance for HPC 26/ 56

Page 29: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Outline

1 Application-specific fault-tolerance techniques (45mn)Fault-Tolerant MiddlewareBags of tasksIterative algorithms and fixed-point convergenceABFT for Linear Algebra applicationsComposite approach: ABFT & Checkpointing

[email protected][email protected] Fault-tolerance for HPC 27/ 56

Page 30: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Example: block LU/QR factorization

A A'

U

L

U

Solve A · x = b (hard)

Transform A into a LU factorization

Solve L · y = b, then U · x = y

[email protected][email protected] Fault-tolerance for HPC 28/ 56

Page 31: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Example: block LU/QR factorization

A A'

U

L

U

GETF2: factorize acolumn block

TRSM - Update row block

GEMM: Updatethe trailing

matrix

Solve A · x = b (hard)

Transform A into a LU factorization

Solve L · y = b, then U · x = y

[email protected][email protected] Fault-tolerance for HPC 28/ 56

Page 32: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Example: block LU/QR factorization

L

U U

L

U

GETF2: factorize acolumn block

TRSM - Update row block

GEMM: Updatethe trailing

matrix

L

U

Solve A · x = b (hard)

Transform A into a LU factorization

Solve L · y = b, then U · x = y

[email protected][email protected] Fault-tolerance for HPC 28/ 56

Page 33: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Example: block LU/QR factorization

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

0 21 30 21 30 21 30 21 3

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

0 21 30 21 30 21 30 21 3

Failure of rank 2

2D Block Cyclic Distribution (here 2× 3)

A single failure ⇒ many data lost

[email protected][email protected] Fault-tolerance for HPC 28/ 56

Page 34: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Algorithm Based Fault Tolerant LU decomposition

M

Pmb

nbQ

N< 2N/Q + nb

+++

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

45

0 21 3

45

0 21 3

45

0 21 3

45

0 21 3

45

0 21 3

45

0 21 3

45

0 21 3

45

0 21 3

Checksum: invertible operation on the data of the row /column

Checksum blocks are doubled, to allow recovery when dataand checksum are lost together

[email protected][email protected] Fault-tolerance for HPC 29/ 56

Page 35: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Algorithm Based Fault Tolerant LU decomposition

M

P mb

nbQ

NN/Q

+++

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

Checksum: invertible operation on the data of the row /column

Checksum replication can be avoided by dedicating computingresources to checksum storage

[email protected][email protected] Fault-tolerance for HPC 29/ 56

Page 36: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Algorithm Based Fault Tolerant LU decomposition

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

GETF2 GEMM

TRSM

Checksum: invertible operation on the data of the row /column

Idea of ABFT: applying the operation on data and checksumpreserves the checksum properties

[email protected][email protected] Fault-tolerance for HPC 29/ 56

Page 37: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Algorithm Based Fault Tolerant LU decomposition

+

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

Checksum: invertible operation on the data of the row /column

For the part of the data that is not updated this way, thechecksum must be re-calculated

[email protected][email protected] Fault-tolerance for HPC 29/ 56

Page 38: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Algorithm Based Fault Tolerant LU decomposition

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

GETF2 GEMM

TRSM

Checksum: invertible operation on the data of the row /column

To avoid slowing down all processors and panel operation,group checksum updates every q block columns

[email protected][email protected] Fault-tolerance for HPC 29/ 56

Page 39: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Algorithm Based Fault Tolerant LU decomposition

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

GETF2 GEMM

TRSM

Checksum: invertible operation on the data of the row /column

To avoid slowing down all processors and panel operation,group checksum updates every q block columns

[email protected][email protected] Fault-tolerance for HPC 29/ 56

Page 40: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Algorithm Based Fault Tolerant LU decomposition

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

GETF2 GEMM

TRSM

Checksum: invertible operation on the data of the row /column

To avoid slowing down all processors and panel operation,group checksum updates every q block columns

[email protected][email protected] Fault-tolerance for HPC 29/ 56

Page 41: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Algorithm Based Fault Tolerant LU decomposition

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

+

Checksum: invertible operation on the data of the row /column

Then, update the missing coverage. Keep checkpoint blockcolumn to cover failures during that time

[email protected][email protected] Fault-tolerance for HPC 29/ 56

Page 42: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Algorithm Based Fault Tolerant LU decomposition

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

---

In case of failure, conclude the operation, then

Missing Data = Checksum - Sum(Existing Data) s

[email protected][email protected] Fault-tolerance for HPC 30/ 56

Page 43: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Algorithm Based Fault Tolerant LU decomposition

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

+++

In case of failure, conclude the operation, then

Missing Checksum = Sum(Existing Data)s

[email protected][email protected] Fault-tolerance for HPC 30/ 56

Page 44: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

ABFT LU decomposition: implementation

MPI Implementation

PBLAS-based: need to provide “Fault-Aware” version of thelibrary

Cannot enter recovery state at any point in time: need tocomplete ongoing operations despite failures

Recovery starts by defining the position of each process in thefactorization and bring them all in a consistent state(checksum property holds)

Need to test the return code of each and every MPI-relatedcall

[email protected][email protected] Fault-tolerance for HPC 31/ 56

Page 45: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

ABFT LU decomposition: performance

As supercomputers grow ever larger in scale, the Mean Time to Failure becomes shorter and shorter, making the complete and successful execution of complex applications more and more difficult. FT-LA delivers a new approach, utilizing Algorithm-Based Fault Tolerance (ABFT), to help factorization algorithms survive fail-stop failures. The FT-LA software package extends ScaLAPACK with ABFT routines, and in sharp contrast with legacy checkpoint-based approaches, ABFT does not incur I/O overhead, and promises a much more scalable protection scheme.

ABFT THE IDEA

Cost of ABFT comes only from extra flops (to update checksums) and extra storage

Cost decreases with machine scale (divided by Q when using PxQ processes)

PROTECTION

Matrix protected by block row checksum

The algorithm updates both the trailing matrix AND the checksums

RECOVERY

Missing blocks reconstructed by inverting the checksum operation

FUNCTIONALITY COVERAGE

Linear Systems of Equations

Least Squares

Cholesky, LU

QR (with protection of the upper and lower factors)

FEATURES

WORK IN PROGRESS

Covering four precisions: double complex, single complex, double real, single real (ZCDS)

Deploys on MPI FT draft (ULFM), or with “Checkpoint-on-failure”

Allows toleration of permanent crashes

Hessenber Reduction, Soft (silent) Errors

Process grid: p x qF: simultaneous failures tolerated

Protection against 2 faults on 192x192 processes => 1% overhead

Usually F << q; Overheads in F/q

Protection cost is inversely proportional to machine scale!

Computation

Memory

Flops for the checksum update

Matrix is extended with 2F columns every q columns

FIND OUT MORE AT http://icl.cs.utk.edu/ft-la

0

7

14

21

28

35

6x6; 20k12x12; 40k

24x24; 80k48x48; 160k

96x96; 320k192x192; 640k 0

10

20

30

40

50

Rela

tive

Over

head

(%)

Perfo

rman

ce (T

Flop

/s)

#Processors (PxQ grid); Matrix size (N)

ScaLAPACK PDGETRFFT-PDGETRF (no error)

FT-PDGETRF (w/1 recovery)Overhead: FT-PDGETRF (no error)

Overhead: FT-PDGETRF (w/1 recovery)

U

L

C’

GETF2 GEMM

TRSM

A’

L

0 4 6 0 4 6

1 3 5 7 1 3 5 7

0 4 6 0 4 6

1 3 5 7 1 3 5 7

0 4 6 0 4 6

1 3 5 7 1 3 5 7

0 4 6 0 4 6

1 3 5 7 1 3 5 7

0 4 6 0 4 6

1 3 5 7 1 3 5 7

C

PERFORMANCE ON KRAKEN

MPI-Next ULFM Performance

Open MPI with ULFM; Kraken supercomputer;

[email protected][email protected] Fault-tolerance for HPC 32/ 56

Page 46: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

ABFT LU decomposition: implementation

?

ABFTRecovery

Checkpoint on Failure - MPI Implementation

FT-MPI / MPI-Next FT: not easily available on largemachines

Checkpoint on Failure = workaround

[email protected][email protected] Fault-tolerance for HPC 33/ 56

Page 47: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

ABFT QR decomposition: performance

0

0.5

1

1.5

2

2.5

3

3.5

20k 40k 60k 80k 100k

Pe

rfo

rma

nc

e (

Tfl

op

s/s

)

Matrix Size (N)

ScaLAPACK QRCoF-QR (w/o failure)CoF-QR (w/1 failure)

Fig. 2. ABFT QR andone CoF recovery onKraken (Lustre).

0

100

200

300

400

500

600

700

800

10k 20k 30k 40k 50k

Pe

rfo

rma

nc

e (

Gfl

op

s/s

)

Matrix Size (N)

ScaLAPACK QRCoF-QR (w/o failure)CoF-QR (w/1 failure)

Fig. 3. ABFT QR andone CoF recovery onDancer (local SSD).

0

1

2

3

4

5

6

7

20k 25k 30k 35k 40k 45k 50k

Ap

pli

ca

tio

n T

ime

Sh

are

(%

)

Matrix Size (N)

Load CheckpointDump Checkpoint

ABFT Recovery

Fig. 4. Time breakdownof one CoF recovery onDancer (local SSD).

5.3 Checkpoint-on-Failure QR Performance

Supercomputer Performance: Figure 2 presents the performance on the Krakensupercomputer. The process grid is 24⇥24 and the block size is 100. The CoF-QR(no failure) presents the performance of the CoF QR implementation, in a fault-free execution; it is noteworthy, that when there are no failures, the performanceis exactly identical to the performance of the unmodified FT-QR implementa-tion. The CoF-QR (with failure) curves present the performance when a failureis injected after the first step of the PDLARFB kernel. The performance of thenon-fault tolerant ScaLAPACK QR is also presented for reference.

Without failures, the performance overhead compared to the regular ScaLA-PACK is caused by the extra computation to maintain the checksums inherentto the ABFT algorithm [12]; this extra computation is unchanged between CoF-QR and FT-QR. Only on runs where a failure happened do the CoF protocolsundergoe the supplementary overhead of storing and reloading checkpoints. How-ever, the performance of the CoF-QR remains very close to the no-failure case.For instance, at matrix size N=100,000, CoF-QR still achieves 2.86 Tflop/s afterrecovering from a failure, which is 90% of the performance of the non-fault toler-ant ScaLAPACK QR. This demonstrates that the CoF protocol enables e�cient,practical recovery schemes on supercomputers.

Impact of Local Checkpoint Storage: Figure 3 presents the performance of theCoF-QR implementation on the Dancer cluster with a 8 ⇥ 16 process grid. Al-though a smaller test platform, the Dancer cluster features local storage on nodesand a variety of performance analysis tools unavailable on Kraken. As expected(see [12]), the ABFT method has a higher relative cost on this smaller machine.Compared to the Kraken platform, the relative cost of CoF failure recovery issmaller on Dancer. The CoF protocol incurs disk accesses to store and loadcheckpoints when a failure hits, hence the recovery overhead depends on I/Operformance. By breaking down the relative cost of each recovery step in CoF,Figure 4 shows that checkpoint saving and loading only take a small percentageof the total run-time, thanks to the availability of solid state disks on every node.Since checkpoint reloading immediately follows checkpointing, the OS cache sat-isfy most disk access, resulting in high I/O performance. For matrices larger than

Checkpoint on Failure - MPI Performance

Open MPI; Kraken supercomputer;

[email protected][email protected] Fault-tolerance for HPC 34/ 56

Page 48: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Outline

1 Application-specific fault-tolerance techniques (45mn)Fault-Tolerant MiddlewareBags of tasksIterative algorithms and fixed-point convergenceABFT for Linear Algebra applicationsComposite approach: ABFT & Checkpointing

[email protected][email protected] Fault-tolerance for HPC 35/ 56

Page 49: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Fault Tolerance Techniques

General Techniques

Replication

Rollback Recovery

Coordinated CheckpointingUncoordinated Checkpointing &Message LoggingHierarchical Checkpointing

Application-Specific Techniques

Algorithm Based Fault Tolerance(ABFT)

Iterative Convergence

Approximated Computation

[email protected][email protected] Fault-tolerance for HPC 36/ 56

Page 50: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Application

Typical Application

f o r ( an insanenumber ) {/∗ E x t r a c t data from∗ s i m u l a t i o n , f i l l up∗ m a t r i x ∗/

sim2mat ( ) ;

/∗ F a c t o r i z e matr ix ,∗ S o l v e ∗/

d g e q r f ( ) ;d s o l v e ( ) ;

/∗ Update s i m u l a t i o n∗ w i t h r e s u l t v e c t o r ∗/

vec2s im ( ) ;}

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

LIBRARY Phase GENERAL Phase

Characteristics

, Large part of (total)computation spent infactorization/solve

Between LA operations:

/ use resulting vector / matrixwith operations that do notpreserve the checksums onthe data

/ modify data not covered byABFT algorithms

[email protected][email protected] Fault-tolerance for HPC 37/ 56

Page 51: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Application

Typical Application

f o r ( an insanenumber ) {/∗ E x t r a c t data from∗ s i m u l a t i o n , f i l l up∗ m a t r i x ∗/

sim2mat ( ) ;

/∗ F a c t o r i z e matr ix ,∗ S o l v e ∗/

d g e q r f ( ) ;d s o l v e ( ) ;

/∗ Update s i m u l a t i o n∗ w i t h r e s u l t v e c t o r ∗/

vec2s im ( ) ;}

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

LIBRARY Phase GENERAL Phase

Characteristics

, Large part of (total)computation spent infactorization/solve

Between LA operations:

/ use resulting vector / matrixwith operations that do notpreserve the checksums onthe data

/ modify data not covered byABFT algorithms

Goodbye ABFT?!

[email protected][email protected] Fault-tolerance for HPC 37/ 56

Page 52: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Application

Typical Application

f o r ( an insanenumber ) {/∗ E x t r a c t data from∗ s i m u l a t i o n , f i l l up∗ m a t r i x ∗/

sim2mat ( ) ;

/∗ F a c t o r i z e matr ix ,∗ S o l v e ∗/

d g e q r f ( ) ;d s o l v e ( ) ;

/∗ Update s i m u l a t i o n∗ w i t h r e s u l t v e c t o r ∗/

vec2s im ( ) ;}

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

LIBRARY Phase GENERAL Phase

Characteristics

, Large part of (total)computation spent infactorization/solve

Between LA operations:

/ use resulting vector / matrixwith operations that do notpreserve the checksums onthe data

/ modify data not covered byABFT algorithms

Problem Statement

How to use fault tolerant operations(∗) within anon-fault tolerant(∗∗) application?(∗∗∗)

(*) ABFT, or other application-specific FT(**) Or within an application that does not have the same kind of FT

(***) And keep the application globally fault tolerant...

[email protected][email protected] Fault-tolerance for HPC 37/ 56

Page 53: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

ABFT&PeriodicCkpt

ABFT&PeriodicCkpt: no failure

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

PeriodicCheckpoint

SplitForced

Checkpoints

[email protected][email protected] Fault-tolerance for HPC 38/ 56

Page 54: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

ABFT&PeriodicCkpt

ABFT&PeriodicCkpt: failure during Library phase

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

Failure(during LIBRARY)

Rollback(partial)

Recovery

ABFTRecovery

[email protected][email protected] Fault-tolerance for HPC 39/ 56

Page 55: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

ABFT&PeriodicCkpt

ABFT&PeriodicCkpt: failure during General phase

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

Failure(during GENERAL)

Rollback(fulll)

Recovery

[email protected][email protected] Fault-tolerance for HPC 40/ 56

Page 56: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

ABFT&PeriodicCkpt: Optimizations

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

ABFT&PERIO

DICCKPT

ABFT&PeriodicCkpt: Optimizations

If the duration of the General phase is too small: don’t addcheckpoints

If the duration of the Library phase is too small: don’t doABFT recovery, remain in General mode

this assumes a performance model for the library call

[email protected][email protected] Fault-tolerance for HPC 41/ 56

Page 57: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

ABFT&PeriodicCkpt: Optimizations

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

ABFT&PERIO

DICCKPT

GENERALCheckpoint Interval

ABFT&PeriodicCkpt: Optimizations

If the duration of the General phase is too small: don’t addcheckpoints

If the duration of the Library phase is too small: don’t doABFT recovery, remain in General mode

this assumes a performance model for the library call

[email protected][email protected] Fault-tolerance for HPC 41/ 56

Page 58: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

A few notations

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

T0

TG TL

PG

Times, Periods

T0: Duration of an Epoch (without FT)TL = αT0: Time spent in the Library phaseTG = (1− α)T0: Time spent in the General phasePG : Periodic Checkpointing PeriodTff,Tff

G ,TffL : “Fault Free” times

t lostG , t lost

L : Lost time (recovery overhreads)T finalG ,T final

L : Total times (with faults)

[email protected][email protected] Fault-tolerance for HPC 42/ 56

Page 59: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

A few notations

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

C CLCL

Costs

CL = ρC : time to take a checkpoint of the Library data setCL = (1− ρ)C : time to take a checkpoint of the General datasetR,RL: time to load a full / General data set checkpointD: down time (time to allocate a new machine / reboot)ReconsABFT: time to apply the ABFT recoveryφ: Slowdown factor on the Library phase, when applying ABFT

[email protected][email protected] Fault-tolerance for HPC 42/ 56

Page 60: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

General phase, fault free waste

General phase

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

PeriodicCheckpoint

SplitForced

Checkpoints

Without Failures

TffG =

{TG + CL if TG < PGTG

PG−C × PG if TG ≥ PG

[email protected][email protected] Fault-tolerance for HPC 43/ 56

Page 61: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Library phase, fault free waste

Library phase

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

PeriodicCheckpoint

SplitForced

Checkpoints

Without Failures

TffL = φ× TL + CL

[email protected][email protected] Fault-tolerance for HPC 44/ 56

Page 62: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

General phase, failure overhead

General phase

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

Failure(during GENERAL)

Rollback(fulll)

Recovery

Failure Overhead

t lostG =

{D + R +

TffG

2 if TG < PG

D + R + PG2 if TG ≥ PG

[email protected][email protected] Fault-tolerance for HPC 45/ 56

Page 63: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Library phase, failure overhead

Library phase

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

Failure(during LIBRARY)

Rollback(partial)

Recovery

ABFTRecovery

Failure Overhead

t lostL = D + RL + ReconsABFT

[email protected][email protected] Fault-tolerance for HPC 46/ 56

Page 64: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Overall

Overall

Time (with overheads) of Library phase is constant (in PG ):

T finalL =

1

1− D+RL+ReconsABFT

µ

× (α× TL + CL)

Time (with overehads) of General phase accepts two cases:

T finalG =

1

1−D+R+TG +C

L2

µ

× (TG + CL) if TG < PG

TG

(1− CPG

)(1−D+R+PG

)

if TG ≥ PG

Which is minimal in the second case, if

PG =√

2C (µ− D − R)

[email protected][email protected] Fault-tolerance for HPC 47/ 56

Page 65: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Waste

From the previous, we derive the waste, which is obtained by

Waste = 1− T0

T finalG + T final

L

[email protected][email protected] Fault-tolerance for HPC 48/ 56

Page 66: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Toward Exascale, and Beyond!

Let’s think at scale

Number of components ↗⇒ MTBF ↘Number of components ↗⇒ Problem Size ↗Problem Size ↗⇒

Computation Time spent in Library phase ↗

, ABFT&PeriodicCkpt should perform better with scale

ij/ By how much?

[email protected][email protected] Fault-tolerance for HPC 49/ 56

Page 67: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Competitors

FT algorithms compared

PeriodicCkpt Basic periodic checkpointing

Bi-PeriodicCkpt Applies incremental checkpointing techniques tosave only the library data during the library phase.

ABFT&PeriodicCkpt The algorithm described above

[email protected][email protected] Fault-tolerance for HPC 50/ 56

Page 68: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Weak Scale #1

Weak Scale Scenario #1

Number of components, n, increase

Memory per component remains constant

Problem Size increases in O(√n) (e.g. matrix operation)

µ at n = 105: 1 day, is in O( 1n )

C (=R) at n = 105, is 1 minute, is in O(n)

α is constant at 0.8, as is ρ.

(both Library and General phase increase in time at thesame speed)

[email protected][email protected] Fault-tolerance for HPC 51/ 56

Page 69: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Weak Scale #1

0

10

20

30

40

# F

au

lts Nb Faults PeriodicCkpt

Nb Faults Bi-PeriodicCkptNb Faults ABFT PeriodicCkpt

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1k 10k 100k 1M

Wa

ste

Nodes

PeriodicCkptBi-PeriodicCkpt

ABFT PeriodicCkpt

[email protected][email protected] Fault-tolerance for HPC 52/ 56

Page 70: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Weak Scale #2

Weak Scale Scenario #2

Number of components, n, increase

Memory per component remains constant

Problem Size increases in O(√n) (e.g. matrix operation)

µ at n = 105: 1 day, is O( 1n )

C (=R) at n = 105, is 1 minute, is in O(n)

ρ remains constant at 0.8, but Library phase is O(n3) whenGeneral phases progresses in O(n2) (α is 0.8 at n = 105

nodes).

[email protected][email protected] Fault-tolerance for HPC 53/ 56

Page 71: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Weak Scale #2

0

10

20

30

40

# F

aults Nb Faults PeriodicCkpt

Nb Faults Bi-PeriodicCkptNb Faults ABFT PeriodicCkpt

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1k 10k 100k 1M0.00

0.12

0.25

0.38

0.50

0.62

0.75

0.88

1.00

Waste

Ratio o

f tim

e s

pent in

the A

BF

T r

outine

Nodes

PeriodicCkptBi-PeriodicCkpt

ABFT PeriodicCkptABFT Ratio

[email protected][email protected] Fault-tolerance for HPC 54/ 56

Page 72: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Weak Scale #3

Weak Scale Scenario #3

Number of components, n, increase

Memory per component remains constant

Problem Size increases in O(√n) (e.g. matrix operation)

µ at n = 105: 1 day, is O( 1n )

C (=R) at n = 105, is 1 minute, stays independent of n(O(1))

ρ remains constant at 0.8, but Library phase is O(n3) whenGeneral phases progresses in O(n2) (α is 0.8 at n = 105

nodes).

[email protected][email protected] Fault-tolerance for HPC 55/ 56

Page 73: An overview of fault-tolerant techniques for HPCperso.ens-lyon.fr/frederic.vivien/Enseignement/Master2-2014-2015/0… · App. Speci c FT An overview of fault-tolerant techniques for

App. Specific FT

Weak Scale #3

0

2

4

6

# F

au

lts

Nb Faults PeriodicCkptNb Faults Bi-PeriodicCkpt

Nb Faults ABFT PeriodicCkpt

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1k! = 0.55

10k! = 0.8

100k! = 0.92

1M! = 0.975

Wa

ste

Nodes

PeriodicCkptBi-PeriodicCkpt

ABFT PeriodicCkpt

[email protected][email protected] Fault-tolerance for HPC 56/ 56