Upload
wilfred-cook
View
214
Download
1
Embed Size (px)
Citation preview
Priority Research Direction (use one slide for each)
Key challenges
-Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local management-Resilience <--> new tech (Flash mem, virtualization)-Resilience <--> power consumption |heterogeneity-Resilient Storage and file systems-Extending applicability of checkpoint-Replication (backup core) <--> Rollback recovery-Fault recovery <--> Fault avoidance (migration)-Transparent <--> Application guided-Resilience and programming/execution models-Resilient apps. & algo. possibly with OS support-Language / compiler support for resilience-Experimental env. to stress & compare solutions
barriers and gaps:-Cope with a continuous flow of different errors-faults including soft errors (silent or not)
-Current techniques (ckpt/rest) will not scale
-Limits of Storage and file systems
-Software stack is not fault aware
-Provide Verification of large, long time scale simulation
Better consistency in error/fault management across software layers [S]
System + Application interactions to manage errors and fault [M]
Naturally Resilient system [M] and application software [L]
-Resilience is a key issue for the Exascale community-Enable tightly coupled applications to run longer-Better resilience will provide better efficiency (full system)
Summary of research direction
Potential impact on software component
Potential impact on usability, capability, and breadth of community
4.x ResilienceResilience is a critical issue to achieve high apps. throughput
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
Ne
tT
hro
ug
hp
ut
10 Peta 1 Exa100 Peta
All software should be fault aware and consistent
Fault oblivious Applications
Application should be ableto dynamically handle errors
Extend applicability of checkpointing-- IO caching (e.g., NAND)-- New FT protocols
System level fault-tolerance-- prediction for time optimal checkpointing and migration-- isolation and local recovery/management
Improved hardware and software reliability-- better RAS collection and analysis (root cause)-- Integration
LongMediumShort
MTBF=<1h MTBF=<10mMTBF=<10h(based on DARPA report)
MTBF=day
MTBF=10h MTBF=<1h
Terms:
Fault Repair Fault Avoidance
4.x Resilience• Technology drivers-
Increase of the number of errors, variety of errors.
-Huge increase of components and threads, Power management, New hardware (Flash Mem., Accel., )
-Increase of the data size, limit of centralized I/O, higher potential bandwidth of local storage.
• Alternative R&D strategies-Fault recovery <--> Fault avoidance (migration)-Transparent <--> Application directed-Replication (backup core) <--> Rollback Recovery (replicate locally and restart globally?)
• Recommended research agenda-
Fault understanding (RAS analysis), modeling, prediction [S-M]
-Fault isolation/confinement + local management [M]
-Virtualization [S]
-Extending the applicability of Rollback recovery (reducing ckpt size, caching, scalable FT protocols) [S]
-Resilient Storage and file systems [S-L]
-Resilience and programming/execution models (MW, Map Reduce, Transactions) [M-L]
-Language / compiler support for resilience [M]
-Resilient apps & algorithms (forward recovery, NFTA, ABFT) possibly with OS support [L]
-Experimental environment to stress envisioned solutions [M]
• Crosscutting considerations-
Resilience <--> power management, performance (fault free situation and when faults occur)
-scalability, programmability,