3
Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local management -Resilience <--> new tech (Flash mem, virtualization) -Resilience <--> power consumption | heterogeneity -Resilient Storage and file systems -Extending applicability of checkpoint -Replication (backup core) <--> Rollback recovery -Fault recovery <--> Fault avoidance (migration) -Transparent <--> Application guided -Resilience and programming/execution models -Resilient apps. & algo. possibly with OS support -Language / compiler support for resilience barriers and gaps: -Cope with a continuous flow of different errors-faults including soft errors (silent or not) -Current techniques (ckpt/rest) will not scale -Limits of Storage and file systems -Software stack is not fault aware -Provide Verification of large, long time scale simulation Better consistency in error/fault management across software layers [S] System + Application interactions to manage errors and fault [M] Naturally Resilient system [M] and application software [L] -Resilience is a key issue for the Exascale community -Enable tightly coupled applications to run longer -Better resilience will provide better efficiency (full system) Summary of research direction Potential impact on software component Potential impact on usability, capabilit and breadth of community

Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local

Embed Size (px)

Citation preview

Page 1: Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local

Priority Research Direction (use one slide for each)

Key challenges

-Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local management-Resilience <--> new tech (Flash mem, virtualization)-Resilience <--> power consumption |heterogeneity-Resilient Storage and file systems-Extending applicability of checkpoint-Replication (backup core) <--> Rollback recovery-Fault recovery <--> Fault avoidance (migration)-Transparent <--> Application guided-Resilience and programming/execution models-Resilient apps. & algo. possibly with OS support-Language / compiler support for resilience-Experimental env. to stress & compare solutions

barriers and gaps:-Cope with a continuous flow of different errors-faults including soft errors (silent or not)

-Current techniques (ckpt/rest) will not scale

-Limits of Storage and file systems

-Software stack is not fault aware

-Provide Verification of large, long time scale simulation

Better consistency in error/fault management across software layers [S]

System + Application interactions to manage errors and fault [M]

Naturally Resilient system [M] and application software [L]

-Resilience is a key issue for the Exascale community-Enable tightly coupled applications to run longer-Better resilience will provide better efficiency (full system)

Summary of research direction

Potential impact on software component

Potential impact on usability, capability, and breadth of community

Page 2: Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local

4.x ResilienceResilience is a critical issue to achieve high apps. throughput

2010 2011 2012 2013 2014 2015 2016 2017 2018 2019

Ne

tT

hro

ug

hp

ut

10 Peta 1 Exa100 Peta

All software should be fault aware and consistent

Fault oblivious Applications

Application should be ableto dynamically handle errors

Extend applicability of checkpointing-- IO caching (e.g., NAND)-- New FT protocols

System level fault-tolerance-- prediction  for time optimal checkpointing and migration-- isolation and  local recovery/management

Improved hardware and software reliability-- better RAS collection and analysis (root cause)-- Integration

LongMediumShort

MTBF=<1h MTBF=<10mMTBF=<10h(based on DARPA report)

MTBF=day

MTBF=10h MTBF=<1h

Terms:

Fault Repair Fault Avoidance

Page 3: Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local

4.x Resilience• Technology drivers-

Increase of the number of errors, variety of errors.

-Huge increase of components and threads, Power management, New hardware (Flash Mem., Accel., )

-Increase of the data size, limit of centralized I/O, higher potential bandwidth of local storage.

• Alternative R&D strategies-Fault recovery <--> Fault avoidance (migration)-Transparent <--> Application directed-Replication (backup core) <--> Rollback Recovery (replicate locally and restart globally?)

• Recommended research agenda-

Fault understanding (RAS analysis), modeling, prediction [S-M]

-Fault isolation/confinement + local management [M]

-Virtualization [S]

-Extending the applicability of Rollback recovery (reducing ckpt size, caching, scalable FT protocols) [S]

-Resilient Storage and file systems [S-L]

-Resilience and programming/execution models (MW, Map Reduce, Transactions) [M-L]

-Language / compiler support for resilience [M]

-Resilient apps & algorithms (forward recovery, NFTA, ABFT) possibly with OS support [L]

-Experimental environment to stress envisioned solutions [M]

• Crosscutting considerations-

Resilience <--> power management, performance (fault free situation and when faults occur)

-scalability, programmability,