Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016
Sara S. Hamouda*, Sanjay Chatterjee**, Nick Vrvilo**, Zoran Budimlic**, Vivek Sarkar**
*The Australian National University, **Rice University
1
Towards Resilient CnC-OCR
CnC-OCR Framework
CnC-OCR Framework
CnC Graph Tuning
Skeleton CnC Program
OCR APIs
CnC Programmer edit
CnC Programmer
2
Open Community Runtime
} run
Reliability of Large Scale Systems
3
• As HPC systems grow larger, their failure rates increase[1].
• Exa-scale systems are expected to have mean time between failures (MTBF) in terms of minutes[2].
[1] B. Schroeder and G. A. Gibson, “Understanding failures in petascale computers,” in Journal of Physics: Conference Series, vol. 78, no. 1. IOP Publishing, 2007 [2] G. Zheng, L. Shi, and L. V. Kalé, “FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI,” in IEEE International Conference on Cluster Computing, pp. 93–103, IEEE, 2004.
Problem• Resilience support is missing from all CnC and OCR
implementations.
4
Problem• Resilience support is missing from all CnC and OCR
implementations.
• Objective: allow large scale CnC-OCR applications to efficiently recover from process failures.
5
Problem• Resilience support is missing from all CnC and OCR
implementations.
• Objective: allow large scale CnC-OCR applications to efficiently recover from process failures.
• Solution:
• Extend OCR with user-level fault tolerance.
• Implement failure recovery algorithm for CnC-OCR applications using OCR fault tolerance support
6
Agenda• OCR User-Level Fault Tolerance
๏ Failure Detection
๏ Failure Propagation
• CnC-OCR Application Recovery
7
User-Level Fault Tolerance
Application
Runtime
Transport LayerFailed processes
Failed tasks Lost data
New tasks and dependence configurations
8
User-Level Fault Tolerance1) Failure Detection
Application
OCR
OCR Transport LayerFailed processes
Failed tasks Lost data
New tasks and dependence configurations
MPI GASNet
User-Level Fault Tolerance1) Failure Detection
Application
OCR
OCR Transport LayerFailed processes
Failed tasks Lost data
New tasks and dependence configurations
MPI GASNet
MPI-ULFM by the MPI Forum
GASNet-Exby LBNL
User-Level Fault Tolerance
Application
OCR
OCR Transport LayerFailed processes
Failed steps Lost item-collections
New steps and dependence configurations
11
MPI-ULFM
1) Failure Detection
MPI User-Level Failure Mitigation
12
MPI User-Level Failure Mitigation
MPI-3
Specified runtime behavior after process failure
New MPI Operations
13
MPI User-Level Failure Mitigation• Assumes fail-stop process failure.
• Allows non-failed processes to communicate.
• Failure reporting via special error codes.
14
MPI User-Level Failure Mitigation
15
• New MPI Operations:
๏ MPI_Comm_failure_get_acked
๏ MPI_Comm_revoke
๏ MPI_Comm_shrink
๏ MPI_Comm_agree
MPI User-Level Failure Mitigation
MPI_Comm_failure_get_acked(*comm, &failed_group);
B
failed_group
16
CBA D• New MPI Operations:
๏ MPI_Comm_failure_get_acked
๏ MPI_Comm_revoke
๏ MPI_Comm_shrink
๏ MPI_Comm_agree
MPI User-Level Failure Mitigation
17
• New MPI Operations:
๏ MPI_Comm_failure_get_acked
๏ MPI_Comm_revoke
๏ MPI_Comm_shrink
๏ MPI_Comm_agree
CBA D
MPI User-Level Failure Mitigation
18
• New MPI Operations:
๏ MPI_Comm_failure_get_acked
๏ MPI_Comm_revoke
๏ MPI_Comm_shrink
๏ MPI_Comm_agree
CBA D
MPI_Comm_Revoke(comm);
CBA D
MPI User Level Failure Mitigation
19
• New MPI Operations:
๏ MPI_Comm_failure_get_acked
๏ MPI_Comm_revoke
๏ MPI_Comm_shrink
๏ MPI_Comm_agree
CBA D
MPI_Comm_shrink(comm, newComm);
C DA
MPI User Level Failure Mitigation
20
• New MPI Operations:
๏ MPI_Comm_failure_get_acked
๏ MPI_Comm_revoke
๏ MPI_Comm_shrink
๏ MPI_Comm_agree
CBA D
error_code = MPI_Comm_agree(comm, vote);
vote=0 vote=1
vote=0
vote=1
OCR Failure Detection with ULFM• Failure detection:
๏ MPI_Isend / MPI_Irecv / MPI_Test
๏ MPI_Iprobe (MPI_ANY_SOURCE, …)
• MPI_COMM_WORLD error handler: ๏ Discovers dead processes (MPI_Comm_failure_get_acked)
๏ Does not shrink or revoke the communicator.
๏ Calls OCR->updateDeadLocations(…)
• Added OCR APIs: ๏ void updateDeadLocations(…)
๏ bool ocrIsLocationDead()
๏ u64 ocrGetDeadLocationsCount()
21
OCR Failure Detection with ULFM• Failure detection:
๏ MPI_Isend / MPI_Irecv / MPI_Test
๏ MPI_Iprobe (MPI_ANY_SOURCE, …)
• MPI_COMM_WORLD error handler: ๏ Discovers dead processes (MPI_Comm_failure_get_acked)
๏ Does not shrink or revoke the communicator.
๏ Calls OCR->updateDeadLocations(…)
• Added OCR APIs: ๏ void updateDeadLocations(…)
๏ bool ocrIsLocationDead()
๏ u64 ocrGetDeadLocationsCount()
22
OCR Failure Detection with ULFM• Failure detection:
๏ MPI_Isend / MPI_Irecv / MPI_Test
๏ MPI_Iprobe (MPI_ANY_SOURCE, …)
• MPI_COMM_WORLD error handler: ๏ Discovers dead processes (MPI_Comm_failure_get_acked)
๏ Does not shrink or revoke the communicator.
๏ Calls OCR->updateDeadLocations(…)
• Added OCR APIs: ๏ void updateDeadLocations(…)
๏ bool ocrIsLocationDead()
๏ u64 ocrGetDeadLocationsCount()
23
User-Level Fault Tolerance
Application
OCR
OCR Transport LayerFailed processes
Failed steps Lost item-collections
New steps and dependence configurations
24
MPI-ULFM
2) Failure Propagation
Example OCR Dependence Graph
Task3
Task2
25
EV1
Task1
EV2
Task3
Task2Task1
Remote DependenceP1 P2
P3
EV1 EV2
Task3
Task2Task1
Remote DependenceP1 P2
P3
EV1 EV2
Detecting Lost Dependence• When adding a dependence on a remote event, add a
local proxy event
28
Non ResilientP2
Task1
P1
P2
Task1
EV1
EV1_proxy
P1
Resilient
EV1
P2
Detecting Lost Dependence
29
Non ResilientP2
Task1
P1
P2
Task1
EV1
EV1_proxy
P1
Resilient
EV1
P2
Error Code
• When adding a dependence on a remote event, add a local proxy event
User-Level Fault Tolerance
Application
OCR
OCR Transport LayerFailed processes
Failed steps Lost item-collections
New steps and dependence configurations
30
MPI-ULFM
3) Application Recovery
Monte Carlo PI
finalTask:
PI = 4 * all_points_inside_circle / all_points
worker:1.generate random points2.count all generated points3.count points inside circle4.send counters to accumulator
finalTask
worker worker worker worker
P0
P2P1
31
Monte Carlo PI
finalTask:
PI = 4 * all_points_inside_circle / all_points
worker:1.generate random points2.count all generated points3.count points inside circle4.send counters to accumulator
32
finalTask
worker worker worker worker
P0
P2P1
Monte Carlo PI
finalTask:
PI = 4 * all_points_inside_circle / all_points
worker:1.generate random points2.count all generated points3.count points inside circle4.send counters to accumulator
33
finalTask
worker worker worker worker
P0
P2P1
Monte Carlo PI
finalTask:
PI = 4 * all_points_inside_circle / all_points
worker:1.generate random points2.count all generated points3.count points inside circle4.send counters to accumulator
34
finalTask
worker worker worker worker
P0
P2P1
Error Code
1 void finalTask(cncTag_t i, WorkerData *data1, WorkerData*data2, WorkerData *data3, WorkerData *data4) 2 {3 WorkerData* dep[4] = {data1, data2, data3, data4};4 u32 nPoints = 0;5 u32 nWorkers = 0;6 float PI = 0.0;7 for (int i = 0 ; i < 4; i++) {8 if (dep[i].valid) {9 nPoints += dep[i].points;10 nWorkers ++;11 }12 }13 PI = 4.0f * nPoints / (WORKER_SAMPLES_COUNT * nWorkers) ;14 PRINTF("PI equals %f \n" , PI);15 }
Monte Carlo PI
35
General Application Recovery• How to reconstruct the lost portion of the graph?
• How to manage data versions?
36
CnC-OCR Applications• How to reconstruct the lost portion of the graph? ‣ CnC-OCR has a global knowledge of the whole
graph structure ‣ CnC-OCR distribution tunings provide hints about
task and data locations
• How to manage data versions? ‣ CnC-OCR uses single-assignment
37
CnC Smith-Waterman
( swStep: i, j ) <- [ above: i, j ] $when(i > 0), [ left: i, j ] $when(j > 0) -> [ above: i+1, j ], [ left: i, j+1 ], ( swStep: i+i, j ) $when(i+1 < #rows);
Graph Specification
swStep(0,0)
Smith-Waterman
39
swStep(0,2)swStep(0,1)
swStep(1,0)
swStep(2,0)
swStep(1,1) swStep(1,2)
swStep(2,1) swStep(2,2)
CnC Smith-Waterman
( swStep: i, j ) <- [ above: i, j ] $when(i > 0), [ left: i, j ] $when(j > 0) -> [ above: i+1, j ], [ left: i, j+1 ], ( swStep: i+i, j ) $when(i+1 < #rows);
Graph Specification
( swStep ): { distfn: ( i ) % $RANKS };
Tuning Specification (Row-Block Distribution)
swStep(0,0)
CnC Smith-Waterman
41
swStep(0,2)swStep(0,1)
swStep(1,0)
swStep(2,0)
swStep(1,1) swStep(1,2)
swStep(2,1) swStep(2,2)
P0
P1
P2
swStep(0,0)
CnC Smith-Waterman
42
swStep(0,2)swStep(0,1)
swStep(1,0)
swStep(2,0)
swStep(1,1) swStep(1,2)
swStep(2,1) swStep(2,2)
P0
P1
P2
above(1,0)
left(0,1)
swStep(0,0)
CnC Smith-Waterman
43
swStep(0,2)swStep(0,1)
swStep(1,0)
swStep(2,0)
swStep(1,1) swStep(1,2)
swStep(2,1) swStep(2,2)
P0
P1
P2
above(1,0)
left(0,1)
above(2,0)
left(1,1)
above(1,1)
left(0,2)
swStep(0,0)
CnC Smith-Waterman
44
swStep(0,2)swStep(0,1)
swStep(1,0)
swStep(2,0)
swStep(1,1) swStep(1,2)
swStep(2,1) swStep(2,2)
P0
P1
P2
above(1,0)
left(0,1)
above(2,0)
left(1,1)
above(1,1)
left(0,2)
left(2,1)
above(2,1)
left(1,2)
above(1,2)
swStep(0,0)
CnC Smith-Waterman
45
swStep(0,2)swStep(0,1)
swStep(1,0)
swStep(2,0)
swStep(1,1) swStep(1,2)
swStep(2,1) swStep(2,2)
P0
P1
P2
above(1,0)
left(0,1)
above(2,0)
left(1,1)
above(1,1)
left(0,2)
left(2,1)
above(2,1)
left(1,2)
above(1,2)
above(2,2)
left(2,2)
swStep(0,0)
CnC Smith-Waterman
46
swStep(0,2)swStep(0,1)
swStep(1,0)
swStep(2,0)
swStep(1,1) swStep(1,2)
swStep(2,1) swStep(2,2)
P0
P1
P2
above(1,0)
left(0,1)
above(2,0)
left(1,1)
above(1,1)
left(0,2)
left(2,1)
above(2,1)
left(1,2)
above(1,2)
above(2,2)
left(2,2)
swStep(0,0)
CnC Smith-Waterman
47
swStep(0,2)swStep(0,1)
swStep(1,0)
swStep(2,0)
swStep(1,1) swStep(1,2)
swStep(2,1) swStep(2,2)
P0
P1
P2
swStep(0,0)
CnC Smith-Waterman
48
swStep(0,2)swStep(0,1)
swStep(1,0)
swStep(2,0)
swStep(1,1) swStep(1,2)
swStep(2,1) swStep(2,2)
P0
P1
P2
Error Code Error Code Error Code
swStep(0,0)
Smith-Waterman
49
swStep(1,0)
P0
P1
Error Code
P3
P2
_swStep(0,0)
_swStep(1,0)
swStep(2,0) swStep(2,1) swStep(2,2)
swStep(0,0)
Smith-Waterman
50
swStep(1,0)
P0
P1
Error Code
P3
P2
_swStep(0,0)
_swStep(1,0)
swStep(2,0) swStep(2,1) swStep(2,2)
swStep(0,0)
Smith-Waterman
51
swStep(1,0)
P0
P1
Error Code
P3
P2
_swStep(0,0)
_swStep(1,0)
swStep(2,0) swStep(2,1) swStep(2,2)
left(0,1)
above(1,0)
swStep(0,0)
Smith-Waterman
52
swStep(1,0)
P0
P1
Error Code
P3
P2
_swStep(0,0)
_swStep(1,0)
swStep(2,0) swStep(2,1) swStep(2,2)
left(0,1)
above(1,0)
above(2,0)
left(1,1)
swStep(0,0)
CnC Smith-Waterman
53
swStep(0,2)swStep(0,1)
swStep(1,0)
swStep(2,0)
swStep(1,1) swStep(1,2)
swStep(2,1) swStep(2,2)
P0
P1
P2
Error Code Error Code Error Code
Conclusion• OCR with user level fault tolerance
๏ Failure detection: using MPI-ULFM ๏ Failure propagation: using local proxy event
• CnC-OCR single assignment, global dependence knowledge and tunings simplify application recovery
• Future Work: ๏ Performance Evaluation ๏ Support a more general subset of OCR
54