Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local

Priority Research Direction (use one slide for each)

Key challenges

-Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local management-Resilience <--> new tech (Flash mem, virtualization)-Resilience <--> power consumption |heterogeneity-Resilient Storage and file systems-Extending applicability of checkpoint-Replication (backup core) <--> Rollback recovery-Fault recovery <--> Fault avoidance (migration)-Transparent <--> Application guided-Resilience and programming/execution models-Resilient apps. & algo. possibly with OS support-Language / compiler support for resilience-Experimental env. to stress & compare solutions

barriers and gaps:-Cope with a continuous flow of different errors-faults including soft errors (silent or not)

-Current techniques (ckpt/rest) will not scale

-Limits of Storage and file systems

-Software stack is not fault aware

-Provide Verification of large, long time scale simulation

Better consistency in error/fault management across software layers [S]

System + Application interactions to manage errors and fault [M]

Naturally Resilient system [M] and application software [L]

-Resilience is a key issue for the Exascale community-Enable tightly coupled applications to run longer-Better resilience will provide better efficiency (full system)

Summary of research direction

Potential impact on software component

Potential impact on usability, capability, and breadth of community

4.x ResilienceResilience is a critical issue to achieve high apps. throughput

2010 2011 2012 2013 2014 2015 2016 2017 2018 2019

Ne

tT

hro

ug

hp

ut

10 Peta 1 Exa100 Peta

All software should be fault aware and consistent

Fault oblivious Applications

Application should be ableto dynamically handle errors

Extend applicability of checkpointing-- IO caching (e.g., NAND)-- New FT protocols

System level fault-tolerance-- prediction for time optimal checkpointing and migration-- isolation and local recovery/management

Improved hardware and software reliability-- better RAS collection and analysis (root cause)-- Integration

LongMediumShort

MTBF=<1h MTBF=<10mMTBF=<10h(based on DARPA report)

MTBF=day

MTBF=10h MTBF=<1h

Terms:

Fault Repair Fault Avoidance

4.x Resilience• Technology drivers-

Increase of the number of errors, variety of errors.

-Huge increase of components and threads, Power management, New hardware (Flash Mem., Accel., )

-Increase of the data size, limit of centralized I/O, higher potential bandwidth of local storage.

• Alternative R&D strategies-Fault recovery <--> Fault avoidance (migration)-Transparent <--> Application directed-Replication (backup core) <--> Rollback Recovery (replicate locally and restart globally?)

• Recommended research agenda-

Fault understanding (RAS analysis), modeling, prediction [S-M]

-Fault isolation/confinement + local management [M]

-Virtualization [S]

-Extending the applicability of Rollback recovery (reducing ckpt size, caching, scalable FT protocols) [S]

-Resilient Storage and file systems [S-L]

-Resilience and programming/execution models (MW, Map Reduce, Transactions) [M-L]

-Language / compiler support for resilience [M]

-Resilient apps & algorithms (forward recovery, NFTA, ABFT) possibly with OS support [L]

-Experimental environment to stress envisioned solutions [M]

• Crosscutting considerations-

Resilience <--> power management, performance (fault free situation and when faults occur)

-scalability, programmability,

Documents

Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local