77
Faults Checkpoints Proba models 1 Fault tolerance techniques for high-performance computing Part 1 Anne Benoit ENS Lyon [email protected] http://graal.ens-lyon.fr/ ~ abenoit CR02 - 2016/2017 [email protected] CR02 Fault tolerance (1) 1/ 62

Fault tolerance techniques for high-performance computing

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Fault tolerance techniquesfor high-performance computing

Part 1

Anne Benoit

ENS Lyon

[email protected]

http://graal.ens-lyon.fr/~abenoit

CR02 - 2016/2017

[email protected] CR02 Fault tolerance (1) 1/ 62

Page 2: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Outline

1 Faults and failures

2 Checkpoint and rollback recoveryProcess checkpointingCoordinated checkpointingHierarchical checkpointing

3 Probabilistic modelsYoung/Daly’s approximation

[email protected] CR02 Fault tolerance (1) 2/ 62

Page 3: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Exascale platforms (courtesy Jack Dongarra)

Potential System Architecture with a cap of $200M and 20MW Systems 2011

K computer 2019 Difference

Today & 2019

System peak 10.5 Pflop/s 1 Eflop/s O(100)

Power 12.7 MW ~20 MW

System memory 1.6 PB 32 - 64 PB O(10)

Node performance 128 GF 1,2 or 15TF O(10) – O(100)

Node memory BW 64 GB/s 2 - 4TB/s O(100)

Node concurrency 8 O(1k) or 10k O(100) – O(1000)

Total Node Interconnect BW 20 GB/s 200-400GB/s O(10)

System size (nodes) 88,124 O(100,000) or O(1M) O(10) – O(100)

Total concurrency 705,024 O(billion) O(1,000)

MTTI days O(1 day) - O(10)

[email protected] CR02 Fault tolerance (1) 3/ 62

Page 4: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Exascale platforms (courtesy C. Engelmann & S. Scott)

[email protected] CR02 Fault tolerance (1) 4/ 62

Page 5: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Exascale platforms

Hierarchical• 105 or 106 nodes• Each node equipped with 104 or 103 cores

Failure-prone

MTBF – one node 1 year 10 years 120 yearsMTBF – platform 30sec 5mn 1h

of 106 nodes

More nodes ⇒ Shorter MTBF (Mean Time Between Failures)

[email protected] CR02 Fault tolerance (1) 5/ 62

Page 6: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Exascale platforms

Hierarchical• 105 or 106 nodes• Each node equipped with 104 or 103 cores

Failure-prone

MTBF – one node 1 year 10 years 120 yearsMTBF – platform 30sec 5mn 1h

of 106 nodes

More nodes ⇒ Shorter MTBF (Mean Time Between Failures)

Exascale

6= Petascale ×1000

[email protected] CR02 Fault tolerance (1) 5/ 62

Page 7: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Even for today’s platforms (courtesy F. Cappello)

[email protected] CR02 Fault tolerance (1) 6/ 62

Page 8: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Even for today’s platforms (courtesy F. Cappello)

[email protected] CR02 Fault tolerance (1) 7/ 62

Page 9: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Outline

1 Faults and failures

2 Checkpoint and rollback recovery

3 Probabilistic models

[email protected] CR02 Fault tolerance (1) 8/ 62

Page 10: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Error sources (courtesy Franck Cappello)

•  Analysis of error and failure logs

•  In 2005 (Ph. D. of CHARNG-DA LU) : “Software halts account for the most number of outages (59-84 percent), and take the shortest time to repair (0.6-1.5 hours). Hardware problems, albeit rarer, need 6.3-100.7 hours to solve.”

•  In 2007 (Garth Gibson, ICPP Keynote):

•  In 2008 (Oliner and J. Stearley, DSN Conf.): 50%

Hardware

Conclusion: Both Hardware and Software failures have to be considered

Software errors: Applications, OS bug (kernel panic), communication libs, File system error and other.

Hardware errors, Disks, processors, memory, network

[email protected] CR02 Fault tolerance (1) 9/ 62

Page 11: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

A few definitions

Many types of faults: software error, hardware malfunction,memory corruption

Many possible behaviors: silent, transient, unrecoverable

Restrict to faults that lead to application failures

This includes all hardware faults, and some software ones

Will use terms fault and failure interchangeably

Silent errors (SDC) will be addressed later in the course

First question: quantify the rate or frequency at which thesefaults strike!

[email protected] CR02 Fault tolerance (1) 10/ 62

Page 12: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

A few definitions

Many types of faults: software error, hardware malfunction,memory corruption

Many possible behaviors: silent, transient, unrecoverable

Restrict to faults that lead to application failures

This includes all hardware faults, and some software ones

Will use terms fault and failure interchangeably

Silent errors (SDC) will be addressed later in the course

First question: quantify the rate or frequency at which thesefaults strike!

[email protected] CR02 Fault tolerance (1) 10/ 62

Page 13: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Failure distributions: (1) Exponential

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 200 400 600 800 1000

Fa

ilure

Pro

ba

bili

ty

Time (years)

Sequential Machine

Exp(1/100)

Exp(λ): Exponential distribution law of parameter λ:

Probability density function (pdf): f (t) = λe−λtdt for t ≥ 0

Cumulative distribution function (cdf): F (t) = 1− e−λt

Mean: µ = 1λ

[email protected] CR02 Fault tolerance (1) 11/ 62

Page 14: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Failure distributions: (1) Exponential

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 200 400 600 800 1000

Fa

ilure

Pro

ba

bili

ty

Time (years)

Sequential Machine

Exp(1/100)

X random variable for Exp(λ) failure inter-arrival times:

P (X ≤ t) = 1− e−λtdt (by definition)

Memoryless property: P (X ≥ t + s |X ≥ s ) = P (X ≥ t)(for all t, s ≥ 0): at any instant, time to next failure does notdepend upon time elapsed since last failure

Mean Time Between Failures (MTBF) µ = E (X ) = 1λ

[email protected] CR02 Fault tolerance (1) 11/ 62

Page 15: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Failure distributions: (2) Weibull

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 200 400 600 800 1000

Fa

ilure

Pro

ba

bili

ty

Time (years)

Sequential Machine

Exp(1/100)Weibull(0.7, 1/100)Weibull(0.5, 1/100)

Weibull(k, λ): Weibull distribution law of shape parameter k andscale parameter λ:

Pdf: f (t) = kλ(tλ)k−1e−(λt)kdt for t ≥ 0

Cdf: F (t) = 1− e−(λt)k

Mean: µ = 1λΓ(1 + 1

k )

[email protected] CR02 Fault tolerance (1) 12/ 62

Page 16: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Failure distributions: (2) Weibull

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 200 400 600 800 1000

Fa

ilure

Pro

ba

bili

ty

Time (years)

Sequential Machine

Exp(1/100)Weibull(0.7, 1/100)Weibull(0.5, 1/100)

X random variable for Weibull(k , λ) failure inter-arrival times:

If k < 1: failure rate decreases with time”infant mortality”: defective items fail early

If k = 1: Weibull(1, λ) = Exp(λ) constant failure time

[email protected] CR02 Fault tolerance (1) 12/ 62

Page 17: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Failure distributions: (3) with several processors

Processor (or node): any entity subject to failures⇒ approach agnostic to granularity

If the MTBF is µ with one processor,what is its value with p processors?

Well, it depends /

[email protected] CR02 Fault tolerance (1) 13/ 62

Page 18: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Failure distributions: (3) with several processors

Processor (or node): any entity subject to failures⇒ approach agnostic to granularity

If the MTBF is µ with one processor,what is its value with p processors?

Well, it depends /

[email protected] CR02 Fault tolerance (1) 13/ 62

Page 19: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

With rejuvenation

Rebooting all p processors after a failure

Platform failure distribution⇒ minimum of p IID processor distributions

With p distributions Exp(λ):

min(Exp(λ1),Exp(λ2)

)= Exp(λ1 + λ2)

µ =1

λ⇒ µp =

µ

p

With p distributions Weibull(k, λ):

min1..p

(Weibull(k , λ)

)= Weibull(k , p1/kλ)

µ =1

λΓ(1 +

1

k)⇒ µp =

µ

p1/k

[email protected] CR02 Fault tolerance (1) 14/ 62

Page 20: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Without rejuvenation (= real life)

Rebooting only faulty processor

Platform failure distribution⇒ superposition of p IID processor distributions⇒ IID only for Exponential

Define µp by

limF→+∞

n(F )

F=

1

µp

n(F ) = number of platform failures until time F is exceeded

Theorem: µp =µ

pfor arbitrary distributions

[email protected] CR02 Fault tolerance (1) 15/ 62

Page 21: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Intuition

Time

p1

p2

p3

t

If three processors have around 20 faults during a time t (µ = t20 )...

Time

p

t

...during the same time, the platform has around 60 faults (µp = t60 )

[email protected] CR02 Fault tolerance (1) 16/ 62

Page 22: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

MTBF with p processors (1/2)

Theorem: µp = µp for arbitrary distributions

With one processor:

n(F ) = number of failures until time F is exceeded

Xi iid random variables for inter-arrival times, with E (Xi ) = µ∑n(F )−1i=1 Xi ≤ F ≤

∑n(F )i=1 Xi

Wald’s equation: (E (n(F ))− 1)µ ≤ F ≤ E (n(F ))µ

limF→+∞E(n(F ))

F = 1µ

[email protected] CR02 Fault tolerance (1) 17/ 62

Page 23: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

MTBF with p processors (2/2)

Theorem: µp = µp for arbitrary distributions

With p processors:

n(F ) = number of platform failures until time F is exceeded

nq(F ) = number of those failures that strike processor q

nq(F ) + 1 = number of failures on processor q until time F isexceeded (except for processor with last-failure)

limF→+∞nq(F )F = 1

µ as above

limF→+∞n(F )F = 1

µpby definition

Hence µp = µp because n(F ) =

∑pq=1 nq(F )

[email protected] CR02 Fault tolerance (1) 18/ 62

Page 24: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

A little digression for afficionados

Xi IID random variables for processor inter-arrival times

Assume Xi continuous, with E (Xi ) = µ

Yi random variables for platform inter-arrival times

Definition: µpdef= limn→+∞

∑ni E(Yi )n

Limits always exists (superposition of renewal processes)

Theorem: µp = µp

[email protected] CR02 Fault tolerance (1) 19/ 62

Page 25: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Values from the literature

MTBF of one processor: between 1 and 125 years

Shape parameters for Weibull: k = 0.5 or k = 0.7

Failure trace archive from INRIA(http://fta.inria.fr)

Computer Failure Data Repository from LANL(http://institutes.lanl.gov/data/fdata)

[email protected] CR02 Fault tolerance (1) 20/ 62

Page 26: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Does it matter?

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0h 3h 6h 9h 12h 15h 18h 21h 24h

Failu

re P

robabili

ty

Time (hours)

Parallel machine (106 nodes)

Exp(1/100)Weibull(0.7, 1/100)Weibull(0.5, 1/100)

After infant mortality and before aging,instantaneous failure rate of computer platforms is almost constant

[email protected] CR02 Fault tolerance (1) 21/ 62

Page 27: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Summary for the road

MTBF key parameter and µp = µp ,

Exponential distribution OK for most purposes ,Assume failure independence while not (completely) true /

[email protected] CR02 Fault tolerance (1) 22/ 62

Page 28: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Outline

1 Faults and failures

2 Checkpoint and rollback recoveryProcess checkpointingCoordinated checkpointingHierarchical checkpointing

3 Probabilistic models

[email protected] CR02 Fault tolerance (1) 23/ 62

Page 29: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Outline

1 Faults and failures

2 Checkpoint and rollback recoveryProcess checkpointingCoordinated checkpointingHierarchical checkpointing

3 Probabilistic models

[email protected] CR02 Fault tolerance (1) 24/ 62

Page 30: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Maintaining redundant information

Goal

General Purpose Fault Tolerance Techniques: work despite theapplication behavior

Two adversaries: Failures & Application

Use automatically computed redundant information

At given instants: checkpointsAt any instant: replicationOr anything in between: checkpoint + message logging

[email protected] CR02 Fault tolerance (1) 25/ 62

Page 31: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Process checkpointing

Goal

Save the current state of the process

FT Protocols save a possible state of the parallel application

Techniques

User-level checkpointing

System-level checkpointing

Blocking call

Asynchronous call

[email protected] CR02 Fault tolerance (1) 26/ 62

Page 32: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

User-level checkpointing

User code serializes the state of the process in a file.

Usually small(er than system-level checkpointing)

Portability

Diversity of use

Hard to implement if preemptive checkpointing is needed

Loss of the functions call stack

code full of jumpsloss of internal library state

[email protected] CR02 Fault tolerance (1) 27/ 62

Page 33: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

System-level checkpointing

Different possible implementations: OS syscall; dynamiclibrary; compiler assisted

Create a serial file that can be loaded in a process image.Usually on the same architecture, same OS, same softwareenvironment.

Entirely transparent

Preemptive (often needed for library-level checkpointing)

Lack of portability

Large size of checkpoint (≈ memory footprint)

[email protected] CR02 Fault tolerance (1) 28/ 62

Page 34: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Blocking / Asynchronous call

Blocking checkpointing

Relatively intuitive: checkpoint(filename)

Cost: no process activity during the whole checkpoint operation.Can be linear in the size of memory and in the size of modified files

Asynchronous checkpointing

System-level approach: make use of copy on write of fork syscallUser-level approach: critical sections, when needed

[email protected] CR02 Fault tolerance (1) 29/ 62

Page 35: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Storage

Remote reliable storage

Intuitive. I/O intensive. Disk usage.

Memory hierarchy

local memory

local disk (SSD, HDD)

remote disk

Scalable Checkpoint Restart Libraryhttp://scalablecr.sourceforge.net

Checkpoint is valid when finished on reliable storage

Distributed memory storage

In-memory checkpointing

Disk-less checkpointing

[email protected] CR02 Fault tolerance (1) 30/ 62

Page 36: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Outline

1 Faults and failures

2 Checkpoint and rollback recoveryProcess checkpointingCoordinated checkpointingHierarchical checkpointing

3 Probabilistic models

[email protected] CR02 Fault tolerance (1) 31/ 62

Page 37: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Coordinated checkpointing

orphan

orphan

missing

Definition (Missing Message)

A message is missing if in the current configuration, the sendersent it, while the receiver did not receive it

[email protected] CR02 Fault tolerance (1) 32/ 62

Page 38: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Coordinated checkpointing

orphan

orphan

missing

Definition (Orphan Message)

A message is orphan if in the current configuration, the receiverreceived it, while the sender did not send it

[email protected] CR02 Fault tolerance (1) 33/ 62

Page 39: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Coordinated checkpointing

Create a consistent view of the application (no orphan messages)

Every message belongs to a single checkpoint wave

All communication channels must be flushed (all2all)

[email protected] CR02 Fault tolerance (1) 34/ 62

Page 40: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Blocking coordinated checkpointing

App. Message Marker Message

Silences the network during checkpoint

[email protected] CR02 Fault tolerance (1) 35/ 62

Page 41: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Non-Blocking Coordinated Checkpointing

App. Message Marker Message

Communications received after the beginning of thecheckpoint and before its end are added to the receiver’scheckpoint

Communications inside a checkpoint are pushed back at thebeginning of the queues

[email protected] CR02 Fault tolerance (1) 36/ 62

Page 42: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Implementation

Communication Library

Flush of communication channels

conservative approach. One Message per open channel / Onemessage per channel

Preemptive checkpointing usually required

Can have a user-level checkpointing, but requires one that becalled any time

Application Level

Flush of communication channels

Can be as simple as Barrier(); Checkpoint();

Or as complex as having a quiesce(); function in all libraries

User-level checkpointing

[email protected] CR02 Fault tolerance (1) 37/ 62

Page 43: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Coordinated Protocol Performance

02468

101214

Num

ber o

f che

ckpo

int w

aves

VCL number of wavesPCL number of waves

0 20 40 60 80 100Time between checkpoints (s)

020406080

100120140160

Exec

utio

n tim

e (s

)VCL execution timePCL execution time

BT.B.64

0 1 2 3 4 5 6 7Number of checkpoint waves

020406080

100120140160

Exec

utio

n tim

e (s

)

BT.B.64

02468

101214

Num

ber o

f che

ckpo

int w

aves

CG.C.64

0 50 100Time between checkpoints (s)

0102030405060708090

100

Exec

utio

n tim

e (s

)

0 2 4 6 8Number of checkpoint waves

0102030405060708090

100

Exec

utio

n tim

e (s

)

CG.C.64

Fig. 7. Impact of checkpoint frequency on BT.B.64 and CG.C.64 for myricom network

number of checkpoint waves for each run, while theright figures present the completion time of the sameexperiments as function of the number of checkpointwaves.The PCL execution time follows the number of check-

point waves, and the right figures demonstrate that thecompletion time is a linear function of the number ofcheckpoints. This is easily explained by the synchroniza-tions introduced by the blocking protocol. As explainedin the cluster experiments, the number of checkpointwaves does not directly influence the performance of theVCL implementation.CG is a benchmark with a lot of small communica-

tions, so a latency-bound benchmark. VCL is imple-mented with a communication daemon, and messagehas to pass through two UNIX sockets and the ethernetemulation of the myri2000 card, implying unnecessarycopies and a high latency overhead. This is why PCLperforms much better than VCL for this benchmark.BT is a benchmark computation-bound, with a rel-

atively small number of long communications. So theVCL implementation does not suffer from the overhead

of its messages copies, and the overhead of the synchro-nizations of PCL induces better performances for VCLwith high checkpoint wave frequency.

D. Large scale experimentsThe large-scale experiments are conducted on

Grid5000. Its clusters are interconnected with internetlinks. In order to evaluate the results of the benchmarks,we first measure the raw performance of this platformusing the NetPIPE [20] utility. This is a ping pong testfor several message size and small perturbations aroundthese sizes.Figure 8 presents the bandwidth and latency measure-

ments between each pair of clusters. The network is upto 20 times faster between two nodes of the same clusterthan between two nodes of two clusters. Moreover, thelatency is up to two orders of magnitude greater betweenclusters than between nodes.We present here results only for the PCL implemen-

tation. The VCL implementation was not designed forthis scale, it uses select system call to multiplex itscommunication channels, and this tool is not scalableafter a thousand sockets (in Linux, a file descriptor set

Coordinated Protocol Performance

VCL = nonblocking coordinated protocol

PCL = blocking coordinated protocol

[email protected] CR02 Fault tolerance (1) 38/ 62

Page 44: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Outline

1 Faults and failures

2 Checkpoint and rollback recoveryProcess checkpointingCoordinated checkpointingHierarchical checkpointing

3 Probabilistic models

[email protected] CR02 Fault tolerance (1) 39/ 62

Page 45: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Uncoordinated Checkpointing Idea

Processes checkpoint independently

[email protected] CR02 Fault tolerance (1) 40/ 62

Page 46: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Uncoordinated Checkpointing Idea

Optimistic Protocol

Each process i keeps some checkpoints C ji

∀(i1, . . . in),∃jk/{C jkik} form a consistent cut?

Domino Effect

[email protected] CR02 Fault tolerance (1) 40/ 62

Page 47: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Piece-wise Deterministic Assumption

NondeterministicChoice

DeterministicSequence

Piece-wise Deterministic Assumption

Process: alternate sequence of non-deterministic choice anddeterministic steps

Translated in Message Passing:

Receptions / Progress test are non-deterministic(MPI Wait(ANY SOURCE),if( MPI Test() )<...>; else <...>)Emissions / others are deterministic

[email protected] CR02 Fault tolerance (1) 41/ 62

Page 48: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Message Logging

A

B

P

n emission to A (from B)

m reception from B (on A)

Unique Identifeir: (B, n, A, m)Payload: P

th

th

F Ti probe: falseth

i +1 probe: trueth

Message Logging

By replaying the sequence of messages and test/probe with theresult obtained during the initial execution (from the lastcheckpoint), one can guide the execution of a process to its exactstate just before the failure

[email protected] CR02 Fault tolerance (1) 42/ 62

Page 49: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Message Logging

A

B

P

n emission to A (from B)

m reception from B (on A)

Unique Identifeir: (B, n, A, m)Payload: P

th

th

F Ti probe: falseth

i +1 probe: trueth

Message / Events

Message = unique identifier (source, emission index,destination, reception index) + payload (content of themessage)

Probe = unique identifier (number of consecutivefailed/success probes on this link)

Event Logging: saving the unique identifier of a message, orof a probe

[email protected] CR02 Fault tolerance (1) 42/ 62

Page 50: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Message Logging

A

B

P

n emission to A (from B)

m reception from B (on A)

Unique Identifeir: (B, n, A, m)Payload: P

th

th

F Ti probe: falseth

i +1 probe: trueth

Message / Events

Payload Logging: saving the content of a message

Message Logging: saving the unique identifier and the payloadof a message, saving unique identifiers of probes, saving the(local) order of events

[email protected] CR02 Fault tolerance (1) 42/ 62

Page 51: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Message Logging

P

P

Q

QCheckpoint

P will never berequested again

Checkpoint

Q Q might be requestedif A and B rollback

A

B

Where to save the Payload?

Almost always as Sender Based

Local copy: less impact on performance

More memory demanding → trade-off garbage collectionalgorithm

Payload needs to be included in the checkpoints

[email protected] CR02 Fault tolerance (1) 43/ 62

Page 52: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Message Logging

Where to save the Events?

Events must be saved on a reliable space

Must avoid: loss of events ordering information, for all eventsthat can impact the outgoing communications

Two (three) approaches: pessimistic + reliable system, orcausal, (or optimistic)

[email protected] CR02 Fault tolerance (1) 44/ 62

Page 53: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Optimistic Message Logging

P

Q

A

B

C

Event Log

R

S

Where to save the Events?

On a reliable media, asynchronously

“Hope that the event will have time to be logged” (before itsloss is damageable)

[email protected] CR02 Fault tolerance (1) 45/ 62

Page 54: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Optimistic Message Logging

(A, 1, B, ?) P

Q

(A, 1, B, 1)

A

B

C

Event Log

P

R

S

Where to save the Events?

On a reliable media, asynchronously

“Hope that the event will have time to be logged” (before itsloss is damageable)

[email protected] CR02 Fault tolerance (1) 45/ 62

Page 55: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Optimistic Message Logging

(A, 1, B, ?) P

Q

(A, 1, B, 1) Ack

A

B

C

Event Log

P

R

S

Where to save the Events?

On a reliable media, asynchronously

“Hope that the event will have time to be logged” (before itsloss is damageable)

[email protected] CR02 Fault tolerance (1) 45/ 62

Page 56: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Optimistic Message Logging

(A, 1, B, ?) P

(C, 3, B, ?) Q

(A, 1, B, 1) Ack(C, 3, B, 5)

Ack

A

B

C

Event Log

Q

P

Where to save the Events?

On a reliable media, asynchronously

“Hope that the event will have time to be logged” (before itsloss is damageable)

[email protected] CR02 Fault tolerance (1) 45/ 62

Page 57: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Optimistic Message Logging

(A, 1, B, ?) P

(C, 3, B, ?) Q

(A, 1, B, 1) Ack(C, 3, B, 5)

Ack

A

B

C

Event Log

Q

P

DangerZone

Where to save the Events?

On a reliable media, asynchronously

“Hope that the event will have time to be logged” (before itsloss is damageable)

[email protected] CR02 Fault tolerance (1) 45/ 62

Page 58: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Pessimistic Message Logging

P

Q

A

B

C

Event Log

R

S

Where to save the Events?

On a reliable media, synchronously

Delay of emissions that depend on non-deterministic choicesuntil the corresponding choice is acknowledged

Recovery: connect to the storage system to get the history

[email protected] CR02 Fault tolerance (1) 46/ 62

Page 59: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Pessimistic Message Logging

(A, 1, B, ?) P

(C, 3, B, ?) Q

(A, 1, B, 1) Ack(C, 3, B, 5)

Ack

A

B

C

Event Log

Q

P

EmissionDelayed

Where to save the Events?

On a reliable media, synchronously

Delay of emissions that depend on non-deterministic choicesuntil the corresponding choice is acknowledged

Recovery: connect to the storage system to get the history

[email protected] CR02 Fault tolerance (1) 46/ 62

Page 60: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Causal Message Logging

(A, 1, B, ?) P

(C, 3, B, ?) Q

(A, 1, B, 1) Ack(C, 3, B, 5)

Ack

A

B

C

Event Log

Q

P

msg + (C, 3, B, 5)

msg

Where to save the Events?

Any message carries with it (piggybacked) the whole historyof non-deterministic events that precede

Garbage collection using checkpointing, detection of cycles

Can be coupled with asynchronous storage on reliable mediato help garbage collection

Recovery: global communication + potential storage [email protected] CR02 Fault tolerance (1) 47/ 62

Page 61: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Recover in Message Logging

Resend: P

A

B

C

Event Log

R

S

CKPT

(A, 1, B, 1)(C, 3, B, 5)

Received: empty

Received: R Resend: Q

Recovery

Collect the history (from event log / event log + peers forCausal)

Collect Id of last message sent

Emitters resend, deliver in history order

Fake emission of sent messages

[email protected] CR02 Fault tolerance (1) 48/ 62

Page 62: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Uncoordinated Protocol Performance

0.9

0.95

1

1.05

1.1

1.15

1.2

16 36 64 121 256

Ove

rhea

d

BT Class C

0.9

0.95

1

1.05

1.1

1.15

1.2

16 36 64 121 256

Ove

rhea

d

BT Class C

0.9

0.95

1

1.05

1.1

1.15

1.2

16 36 64 121 256

Ove

rhea

d

BT Class C

0.9

0.95

1

1.05

1.1

1.15

1.2

16 32 64 128 256

BT Class C CG Class C

0.9

0.95

1

1.05

1.1

1.15

1.2

16 32 64 128 256

BT Class C CG Class C

0.9

0.95

1

1.05

1.1

1.15

1.2

16 32 64 128 256

BT Class C CG Class C

0.9

0.95

1

1.05

1.1

1.15

1.2

16 32 64 128 256

BT Class C CG Class C FT Class C

standard

0.9

0.95

1

1.05

1.1

1.15

1.2

16 32 64 128 256

BT Class C CG Class C FT Class C

standardpessimist

0.9

0.95

1

1.05

1.1

1.15

1.2

16 32 64 128 256

BT Class C CG Class C FT Class C

standardpessimist

optimist

0.9

0.95

1

1.05

1.1

1.15

1.2

16 32 64 128 256

Ove

rhea

d

Number of Processes

BT Class C CG Class C FT Class C

LU Class C

0.9

0.95

1

1.05

1.1

1.15

1.2

16 32 64 128 256

Ove

rhea

d

Number of Processes

BT Class C CG Class C FT Class C

LU Class C

0.9

0.95

1

1.05

1.1

1.15

1.2

16 32 64 128 256

Ove

rhea

d

Number of Processes

BT Class C CG Class C FT Class C

LU Class C

0.9

0.95

1

1.05

1.1

1.15

1.2

16 32 64 128 256Number of Processes

BT Class C CG Class C FT Class C

LU Class C MG Class C

0.9

0.95

1

1.05

1.1

1.15

1.2

16 32 64 128 256Number of Processes

BT Class C CG Class C FT Class C

LU Class C MG Class C

0.9

0.95

1

1.05

1.1

1.15

1.2

16 32 64 128 256Number of Processes

BT Class C CG Class C FT Class C

LU Class C MG Class C

0.9

0.95

1

1.05

1.1

1.15

1.2

16 36 64 121 256Number of Processes

BT Class C CG Class C FT Class C

LU Class C MG Class C SP Class C

0.9

0.95

1

1.05

1.1

1.15

1.2

16 36 64 121 256Number of Processes

BT Class C CG Class C FT Class C

LU Class C MG Class C SP Class C

0.9

0.95

1

1.05

1.1

1.15

1.2

16 36 64 121 256Number of Processes

BT Class C CG Class C FT Class C

LU Class C MG Class C SP Class C

Fig. 5. Scalability comparison of pessimistic and optimistic message logging protocols on the NAS Benchmarks on Gigabit Ethernet

B. Scalability

In order to evaluate the comparative scalability of thetwo protocols we plot the normalized execution time of theNAS kernels according to a growing number of processors(figure 5). While in previous experiments (figures 3) we specif-ically outlined the differences caused by non-deterministicevents, in this phase of the comparison we focus on widelyused application kernels. Among the NAS kernels, only twogenerates non deterministic events: MG and LU. As a conse-quence, the executions of the two protocols are very similarand exhibit the same scalability. Overall, the overhead inducedby the sender-based payload copy mechanism stays under 10%on these benchmarks.

The only benchmark showing a different scalability patternis LU. The number of non-deterministic events grows withthe size of the application, making the optimistic protocol 6%more efficient than the pessimistic one for 256 processes.

C. Isolating Event Logging Overhead

Figure 6 presents the performance of all the NAS kernelsfor 64 processes on the Myrinet network. Every kernel isevaluated with or without the sender-based mechanism beingactive. While it is a required component for a successfulrecovery, deactivating the sender-based overhead reveals theperformance differences imputable to the event logging proto-cols. As expected, the performance of event logging exhibitsalmost no differences between the protocols on the bench-marks where there is no non-deterministic events. Even onthose with non-deterministic events, the performance varies

bt.c.64 ft.c.64 lu.c.64 mg.c.64 sp.c.64 cg.c.64

NAS Kernel

1

1.02

1.04

1.06

1.08

Norm

aliz

ed E

xecutio

n T

ime

PessimistOptimistPessimist (Event Logging only)Optimist (Event Logging only)

Fig. 6. Normalized performance of the NAS kernels on the Myrinet 10Gnetwork (Open MPI=1).

only by less than 2%, which is close to the error marginof measurements. On this faster network, the sender-basedoverhead clearly dominates the performance and flattens anyperformance difference coming from the synchronicity of theevent logging.

D. Event Logging Overhead Breakdown

To evaluate the cost of event logging in the protocols, weused a small ping-pong test with 2 processes. The any sourceflag was used in order to generate a non-deterministic eventfor every message reception. Results are presented in Table I.First, when a non deterministic event is created, it has to

Uncoordinated Protocol Performance

NAS Parallel Benchmarks – 64 nodes

High Performance Linpack

Figures courtesy of A. Bouteiller, G. Bosilca

[email protected] CR02 Fault tolerance (1) 49/ 62

Page 63: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Hierarchical Protocols

Many Core Systems

All interactions between threads considered as a message

Explosion of number of events

Cost of message payload logging ≈ cost of communicating →sender-based logging expensive

Correlation of failures on the node

[email protected] CR02 Fault tolerance (1) 50/ 62

Page 64: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Hierarchical Protocols

Hierarchical Protocol

Processes are separated in groups

A group co-ordinates its checkpoint

Between groups, use message logging

[email protected] CR02 Fault tolerance (1) 51/ 62

Page 65: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Hierarchical Protocols

Hierarchical Protocol

Coordinated Checkpointing: the processes can behave as anon-deterministic entity (interactions between processes)

Need to log the non-deterministic events: HierarchicalProtocols are uncoordinated protocols + event logging

No need to log the payload

[email protected] CR02 Fault tolerance (1) 51/ 62

Page 66: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Event Log Reduction

Strategies to reduce the amount of event log

Few HPC applications use message ordering / timinginformation to take decisions

Many receptions (in MPI) are in fact deterministic: do notneed to be logged

For others, although the reception is non-deterministic, theorder does not influence the interactions of the process withthe rest (send-determinism). No need to log either

Reduction of the amount of log to a few applications, for afew messages: event logging can be overlapped

[email protected] CR02 Fault tolerance (1) 52/ 62

Page 67: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Hierarchical Protocol Performance

BT

CGFT

IS

MG SP

Perf. Regular Message Logging / Perf. VanillaPerf. Coordinated Message Logging / Perf. Vanilla

50%60%

80%90%

100%

Fig. 5. NAS performance (Pluto platform, shared memory, 32/36 cores)

algorithm maintains the dependency graph of events and checkpoints to computeZ-paths as the execution progresses. Forced checkpoints are taken whenever aZ-path would become a consistency breaking Z-cycle. This approach has severaldrawbacks: it adds piggyback to messages, and is notably not scalable becausethe number of forced checkpoints grows uncontrollably [1].

Group coordinated checkpoint have been proposed in MVAPICH2 [10] tosolve I/O storming issues in coordinated checkpointing. In this paper, the groupcoordination refers to a particular scheduling of the checkpoint tra�c, intendedto avoid overwhelming the I/O network. Unlike our approach, which is partiallyuncoordinated, this algorithm builds a completely coordinated recovery set.

In [11], Ho, Wang and Lau propose a group-based approach that combinescoordinated and uncoordinated checkpointing, similar to the technique we use inthis paper, to reduce the cost of message logging in uncoordinated checkpointing.Their work, however, focuses on communication patterns of the application, toreduce the amount of message logging. Similarly, in the context of Charm++ [13],and AMPI[16], Meneses, Mendes and Kale have proposed in [8] a team-basedapproach to reduce the overhead of message logging. The Charm++ model advo-cates a high level of oversubscription, with a ratio of user-level thread per coremuch larger than one. In their work, teams are of fixed, predetermined sizes.The paper does not explicitly explain how teams are built, but an emphasis oncommunication patterns seems preferred. In contrast, our work takes advantageof hardware properties of the computing resources, proposing to build correlatedgroups based on likeliness of failures, and relative e�ciency of the communicationmedium.

0

100

200

300

400

500

600

3600 7200 10080 14220

17460 19980

24480

Perfo

rman

ce (G

Flop

/s)

Matrix size (N)

Theoretical peakVanilla Open MPI

Coordinated Message LoggingRegular Message Logging

Fig. 6. HPL cluster performance (Dancer cluster, IB20G, 8 nodes, 64 cores)

6 Concluding Remarks

In this paper, we proposed a novel approach combining the best features of coor-dinated and uncoordinated checkpointing. The resulting fault tolerant protocol,belonging to the event logging protocol family, spares the payload logging formessages belonging to a correlated set, but retains uncoordinated recovery scal-ability. The benefit on shared memory point-to-point performance is significant,which translates into an observable improvement of many application types.Even though inter-node communications are not modified by this approach, theshared memory speedup translates into a reduced overhead on cluster of mul-ticore type platforms. Last, the memory required to hold message payload isgreatly reduced; our algorithm provides a flexible control of the tradeo↵ betweensynchronization and memory consumption. Overall, this work greatly improvesthe applicability of message logging in the context of distributed systems basedon a large number of many-core nodes.

Acknowledgement

This work was partially supported by the DOE Cooperative Agreement DE-FC02-06ER25748, and the INRIA-Illinois Joint Laboratory for Petascale Com-puting and the ANR RESCUE project.

References

1. Alvisi, L., Elnozahy, E., Rao, S., Husain, S.A., Mel, A.D.: An analysis of communi-cation induced checkpointing. In: 29th Symposium on Fault-Tolerant Computing(FTCS’99). IEEE CS Press (june 1999)

Hierarchical Protocol Performance

NAS Parallel Benchmarks – shared memory system, 32 cores

HPL distributed system, 64 cores, 8 groups

[email protected] CR02 Fault tolerance (1) 53/ 62

Page 68: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Outline

1 Faults and failures

2 Checkpoint and rollback recovery

3 Probabilistic modelsYoung/Daly’s approximation

[email protected] CR02 Fault tolerance (1) 54/ 62

Page 69: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Outline

1 Faults and failures

2 Checkpoint and rollback recovery

3 Probabilistic modelsYoung/Daly’s approximation

[email protected] CR02 Fault tolerance (1) 55/ 62

Page 70: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Checkpointing cost

Checkpointing

the first chunk

Computing the first chunk

Processing the second chunkProcessing the first chunk

Time

Time spent checkpointing

Time spent working

Blocking model: while a checkpoint is taken, no computation canbe performed

[email protected] CR02 Fault tolerance (1) 56/ 62

Page 71: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Framework

Periodic checkpointing policy of period T

Independent and identically distributed (IID) failures

Applies to a single processor with MTBF µ = µindApplies to a platform with p processors with MTBF µ = µind

p

coordinated checkpointingtightly-coupled applicationprogress ⇔ all processors available

⇒ platform = single (powerful, unreliable) processor ,

Waste: fraction of time not spent for useful computations

[email protected] CR02 Fault tolerance (1) 57/ 62

Page 72: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Waste in fault-free execution

Checkpointing

the first chunk

Computing the first chunk

Processing the second chunkProcessing the first chunk

Time

Time spent checkpointing

Time spent working Timebase: application base time

TimeFF: with periodic checkpointsbut failure-free

TimeFF = Timebase + #checkpoints × C

#checkpoints =

⌈Timebase

T − C

⌉≈ Timebase

T − C(valid for large jobs)

Waste[FF ] =TimeFF −Timebase

TimeFF=

C

T

[email protected] CR02 Fault tolerance (1) 58/ 62

Page 73: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Waste due to failures

Timebase: application base time

TimeFF: with periodic checkpoints but failure-free

Timefinal: expectation of time with failures

Timefinal = TimeFF + Nfaults × Tlost

Nfaults number of failures during executionTlost: average time lost per failure

Nfaults =Timefinal

µ

Tlost?

[email protected] CR02 Fault tolerance (1) 59/ 62

Page 74: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Waste due to failures

Timebase: application base time

TimeFF: with periodic checkpoints but failure-free

Timefinal: expectation of time with failures

Timefinal = TimeFF + Nfaults × Tlost

Nfaults number of failures during executionTlost: average time lost per failure

Nfaults =Timefinal

µ

Tlost?

[email protected] CR02 Fault tolerance (1) 59/ 62

Page 75: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Computing Tlost

T

CT − CRDT/2

P1

P0

P3

P2

Time spent working Time spent checkpointing

Recovery timeDowntime Time

Tlost = D + R +T

2

Rationale⇒ Instants when periods begin and failures strike are independent⇒ Approximation used for all distribution laws⇒ Exact for Exponential and uniform distributions

[email protected] CR02 Fault tolerance (1) 60/ 62

Page 76: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Waste due to failures

Timefinal = TimeFF + Nfaults × Tlost

Waste[fail ] =Timefinal −TimeFF

Timefinal=

1

µ

(D + R +

T

2

)

[email protected] CR02 Fault tolerance (1) 61/ 62

Page 77: Fault tolerance techniques for high-performance computing

Faults Checkpoints Proba models 1

Total waste

TimeFF =Timefinal (1-Waste[fail]) Timefinal ×Waste[fail ]

Timefinal

T -C C T -C C T -C C T -C C T -C C

Waste =Timefinal −Timebase

Timefinal

1−Waste = (1−Waste[FF ])(1−Waste[fail ])

Waste =C

T+

(1− C

T

)1

µ

(D + R +

T

2

)

How do we minimize the waste? (use the goat’s lemma!)

[email protected] CR02 Fault tolerance (1) 62/ 62