54
CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**, Nick Vrvilo**, Zoran Budimlic**, Vivek Sarkar** *The Australian National University, **Rice University 1 Towards Resilient CnC-OCR

Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016

Sara S. Hamouda*, Sanjay Chatterjee**, Nick Vrvilo**, Zoran Budimlic**, Vivek Sarkar**

*The Australian National University, **Rice University

1

Towards Resilient CnC-OCR

Page 2: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

CnC-OCR Framework

CnC-OCR Framework

CnC Graph Tuning

Skeleton CnC Program

OCR APIs

CnC Programmer edit

CnC Programmer

2

Open Community Runtime

} run

Page 3: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

Reliability of Large Scale Systems

3

• As HPC systems grow larger, their failure rates increase[1].

• Exa-scale systems are expected to have mean time between failures (MTBF) in terms of minutes[2].

[1] B. Schroeder and G. A. Gibson, “Understanding failures in petascale computers,” in Journal of Physics: Conference Series, vol. 78, no. 1. IOP Publishing, 2007 [2] G. Zheng, L. Shi, and L. V. Kalé, “FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI,” in IEEE International Conference on Cluster Computing, pp. 93–103, IEEE, 2004.

Page 4: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

Problem• Resilience support is missing from all CnC and OCR

implementations.

4

Page 5: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

Problem• Resilience support is missing from all CnC and OCR

implementations.

• Objective: allow large scale CnC-OCR applications to efficiently recover from process failures.

5

Page 6: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

Problem• Resilience support is missing from all CnC and OCR

implementations.

• Objective: allow large scale CnC-OCR applications to efficiently recover from process failures.

• Solution:

• Extend OCR with user-level fault tolerance.

• Implement failure recovery algorithm for CnC-OCR applications using OCR fault tolerance support

6

Page 7: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

Agenda• OCR User-Level Fault Tolerance

๏ Failure Detection

๏ Failure Propagation

• CnC-OCR Application Recovery

7

Page 8: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

User-Level Fault Tolerance

Application

Runtime

Transport LayerFailed processes

Failed tasks Lost data

New tasks and dependence configurations

8

Page 9: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

User-Level Fault Tolerance1) Failure Detection

Application

OCR

OCR Transport LayerFailed processes

Failed tasks Lost data

New tasks and dependence configurations

MPI GASNet

Page 10: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

User-Level Fault Tolerance1) Failure Detection

Application

OCR

OCR Transport LayerFailed processes

Failed tasks Lost data

New tasks and dependence configurations

MPI GASNet

MPI-ULFM by the MPI Forum

GASNet-Exby LBNL

Page 11: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

User-Level Fault Tolerance

Application

OCR

OCR Transport LayerFailed processes

Failed steps Lost item-collections

New steps and dependence configurations

11

MPI-ULFM

1) Failure Detection

Page 12: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

MPI User-Level Failure Mitigation

12

Page 13: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

MPI User-Level Failure Mitigation

MPI-3

Specified runtime behavior after process failure

New MPI Operations

13

Page 14: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

MPI User-Level Failure Mitigation• Assumes fail-stop process failure.

• Allows non-failed processes to communicate.

• Failure reporting via special error codes.

14

Page 15: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

MPI User-Level Failure Mitigation

15

• New MPI Operations:

๏ MPI_Comm_failure_get_acked

๏ MPI_Comm_revoke

๏ MPI_Comm_shrink

๏ MPI_Comm_agree

Page 16: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

MPI User-Level Failure Mitigation

MPI_Comm_failure_get_acked(*comm, &failed_group);

B

failed_group

16

CBA D• New MPI Operations:

๏ MPI_Comm_failure_get_acked

๏ MPI_Comm_revoke

๏ MPI_Comm_shrink

๏ MPI_Comm_agree

Page 17: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

MPI User-Level Failure Mitigation

17

• New MPI Operations:

๏ MPI_Comm_failure_get_acked

๏ MPI_Comm_revoke

๏ MPI_Comm_shrink

๏ MPI_Comm_agree

CBA D

Page 18: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

MPI User-Level Failure Mitigation

18

• New MPI Operations:

๏ MPI_Comm_failure_get_acked

๏ MPI_Comm_revoke

๏ MPI_Comm_shrink

๏ MPI_Comm_agree

CBA D

MPI_Comm_Revoke(comm);

CBA D

Page 19: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

MPI User Level Failure Mitigation

19

• New MPI Operations:

๏ MPI_Comm_failure_get_acked

๏ MPI_Comm_revoke

๏ MPI_Comm_shrink

๏ MPI_Comm_agree

CBA D

MPI_Comm_shrink(comm, newComm);

C DA

Page 20: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

MPI User Level Failure Mitigation

20

• New MPI Operations:

๏ MPI_Comm_failure_get_acked

๏ MPI_Comm_revoke

๏ MPI_Comm_shrink

๏ MPI_Comm_agree

CBA D

error_code = MPI_Comm_agree(comm, vote);

vote=0 vote=1

vote=0

vote=1

Page 21: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

OCR Failure Detection with ULFM• Failure detection:

๏ MPI_Isend / MPI_Irecv / MPI_Test

๏ MPI_Iprobe (MPI_ANY_SOURCE, …)

• MPI_COMM_WORLD error handler: ๏ Discovers dead processes (MPI_Comm_failure_get_acked)

๏ Does not shrink or revoke the communicator.

๏ Calls OCR->updateDeadLocations(…)

• Added OCR APIs: ๏ void updateDeadLocations(…)

๏ bool ocrIsLocationDead()

๏ u64 ocrGetDeadLocationsCount()

21

Page 22: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

OCR Failure Detection with ULFM• Failure detection:

๏ MPI_Isend / MPI_Irecv / MPI_Test

๏ MPI_Iprobe (MPI_ANY_SOURCE, …)

• MPI_COMM_WORLD error handler: ๏ Discovers dead processes (MPI_Comm_failure_get_acked)

๏ Does not shrink or revoke the communicator.

๏ Calls OCR->updateDeadLocations(…)

• Added OCR APIs: ๏ void updateDeadLocations(…)

๏ bool ocrIsLocationDead()

๏ u64 ocrGetDeadLocationsCount()

22

Page 23: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

OCR Failure Detection with ULFM• Failure detection:

๏ MPI_Isend / MPI_Irecv / MPI_Test

๏ MPI_Iprobe (MPI_ANY_SOURCE, …)

• MPI_COMM_WORLD error handler: ๏ Discovers dead processes (MPI_Comm_failure_get_acked)

๏ Does not shrink or revoke the communicator.

๏ Calls OCR->updateDeadLocations(…)

• Added OCR APIs: ๏ void updateDeadLocations(…)

๏ bool ocrIsLocationDead()

๏ u64 ocrGetDeadLocationsCount()

23

Page 24: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

User-Level Fault Tolerance

Application

OCR

OCR Transport LayerFailed processes

Failed steps Lost item-collections

New steps and dependence configurations

24

MPI-ULFM

2) Failure Propagation

Page 25: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

Example OCR Dependence Graph

Task3

Task2

25

EV1

Task1

EV2

Page 26: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

Task3

Task2Task1

Remote DependenceP1 P2

P3

EV1 EV2

Page 27: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

Task3

Task2Task1

Remote DependenceP1 P2

P3

EV1 EV2

Page 28: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

Detecting Lost Dependence• When adding a dependence on a remote event, add a

local proxy event

28

Non ResilientP2

Task1

P1

P2

Task1

EV1

EV1_proxy

P1

Resilient

EV1

P2

Page 29: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

Detecting Lost Dependence

29

Non ResilientP2

Task1

P1

P2

Task1

EV1

EV1_proxy

P1

Resilient

EV1

P2

Error Code

• When adding a dependence on a remote event, add a local proxy event

Page 30: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

User-Level Fault Tolerance

Application

OCR

OCR Transport LayerFailed processes

Failed steps Lost item-collections

New steps and dependence configurations

30

MPI-ULFM

3) Application Recovery

Page 31: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

Monte Carlo PI

finalTask:

PI = 4 * all_points_inside_circle / all_points

worker:1.generate random points2.count all generated points3.count points inside circle4.send counters to accumulator

finalTask

worker worker worker worker

P0

P2P1

31

Page 32: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

Monte Carlo PI

finalTask:

PI = 4 * all_points_inside_circle / all_points

worker:1.generate random points2.count all generated points3.count points inside circle4.send counters to accumulator

32

finalTask

worker worker worker worker

P0

P2P1

Page 33: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

Monte Carlo PI

finalTask:

PI = 4 * all_points_inside_circle / all_points

worker:1.generate random points2.count all generated points3.count points inside circle4.send counters to accumulator

33

finalTask

worker worker worker worker

P0

P2P1

Page 34: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

Monte Carlo PI

finalTask:

PI = 4 * all_points_inside_circle / all_points

worker:1.generate random points2.count all generated points3.count points inside circle4.send counters to accumulator

34

finalTask

worker worker worker worker

P0

P2P1

Error Code

Page 35: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

1 void finalTask(cncTag_t i, WorkerData *data1, WorkerData*data2, WorkerData *data3, WorkerData *data4) 2 {3 WorkerData* dep[4] = {data1, data2, data3, data4};4 u32 nPoints = 0;5 u32 nWorkers = 0;6 float PI = 0.0;7 for (int i = 0 ; i < 4; i++) {8 if (dep[i].valid) {9 nPoints += dep[i].points;10 nWorkers ++;11 }12 }13 PI = 4.0f * nPoints / (WORKER_SAMPLES_COUNT * nWorkers) ;14 PRINTF("PI equals %f \n" , PI);15 }

Monte Carlo PI

35

Page 36: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

General Application Recovery• How to reconstruct the lost portion of the graph?

• How to manage data versions?

36

Page 37: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

CnC-OCR Applications• How to reconstruct the lost portion of the graph? ‣ CnC-OCR has a global knowledge of the whole

graph structure ‣ CnC-OCR distribution tunings provide hints about

task and data locations

• How to manage data versions? ‣ CnC-OCR uses single-assignment

37

Page 38: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

CnC Smith-Waterman

( swStep: i, j ) <- [ above: i, j ] $when(i > 0), [ left: i, j ] $when(j > 0) -> [ above: i+1, j ], [ left: i, j+1 ], ( swStep: i+i, j ) $when(i+1 < #rows);

Graph Specification

Page 39: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

swStep(0,0)

Smith-Waterman

39

swStep(0,2)swStep(0,1)

swStep(1,0)

swStep(2,0)

swStep(1,1) swStep(1,2)

swStep(2,1) swStep(2,2)

Page 40: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

CnC Smith-Waterman

( swStep: i, j ) <- [ above: i, j ] $when(i > 0), [ left: i, j ] $when(j > 0) -> [ above: i+1, j ], [ left: i, j+1 ], ( swStep: i+i, j ) $when(i+1 < #rows);

Graph Specification

( swStep ): { distfn: ( i ) % $RANKS };

Tuning Specification (Row-Block Distribution)

Page 41: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

swStep(0,0)

CnC Smith-Waterman

41

swStep(0,2)swStep(0,1)

swStep(1,0)

swStep(2,0)

swStep(1,1) swStep(1,2)

swStep(2,1) swStep(2,2)

P0

P1

P2

Page 42: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

swStep(0,0)

CnC Smith-Waterman

42

swStep(0,2)swStep(0,1)

swStep(1,0)

swStep(2,0)

swStep(1,1) swStep(1,2)

swStep(2,1) swStep(2,2)

P0

P1

P2

above(1,0)

left(0,1)

Page 43: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

swStep(0,0)

CnC Smith-Waterman

43

swStep(0,2)swStep(0,1)

swStep(1,0)

swStep(2,0)

swStep(1,1) swStep(1,2)

swStep(2,1) swStep(2,2)

P0

P1

P2

above(1,0)

left(0,1)

above(2,0)

left(1,1)

above(1,1)

left(0,2)

Page 44: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

swStep(0,0)

CnC Smith-Waterman

44

swStep(0,2)swStep(0,1)

swStep(1,0)

swStep(2,0)

swStep(1,1) swStep(1,2)

swStep(2,1) swStep(2,2)

P0

P1

P2

above(1,0)

left(0,1)

above(2,0)

left(1,1)

above(1,1)

left(0,2)

left(2,1)

above(2,1)

left(1,2)

above(1,2)

Page 45: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

swStep(0,0)

CnC Smith-Waterman

45

swStep(0,2)swStep(0,1)

swStep(1,0)

swStep(2,0)

swStep(1,1) swStep(1,2)

swStep(2,1) swStep(2,2)

P0

P1

P2

above(1,0)

left(0,1)

above(2,0)

left(1,1)

above(1,1)

left(0,2)

left(2,1)

above(2,1)

left(1,2)

above(1,2)

above(2,2)

left(2,2)

Page 46: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

swStep(0,0)

CnC Smith-Waterman

46

swStep(0,2)swStep(0,1)

swStep(1,0)

swStep(2,0)

swStep(1,1) swStep(1,2)

swStep(2,1) swStep(2,2)

P0

P1

P2

above(1,0)

left(0,1)

above(2,0)

left(1,1)

above(1,1)

left(0,2)

left(2,1)

above(2,1)

left(1,2)

above(1,2)

above(2,2)

left(2,2)

Page 47: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

swStep(0,0)

CnC Smith-Waterman

47

swStep(0,2)swStep(0,1)

swStep(1,0)

swStep(2,0)

swStep(1,1) swStep(1,2)

swStep(2,1) swStep(2,2)

P0

P1

P2

Page 48: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

swStep(0,0)

CnC Smith-Waterman

48

swStep(0,2)swStep(0,1)

swStep(1,0)

swStep(2,0)

swStep(1,1) swStep(1,2)

swStep(2,1) swStep(2,2)

P0

P1

P2

Error Code Error Code Error Code

Page 49: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

swStep(0,0)

Smith-Waterman

49

swStep(1,0)

P0

P1

Error Code

P3

P2

_swStep(0,0)

_swStep(1,0)

swStep(2,0) swStep(2,1) swStep(2,2)

Page 50: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

swStep(0,0)

Smith-Waterman

50

swStep(1,0)

P0

P1

Error Code

P3

P2

_swStep(0,0)

_swStep(1,0)

swStep(2,0) swStep(2,1) swStep(2,2)

Page 51: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

swStep(0,0)

Smith-Waterman

51

swStep(1,0)

P0

P1

Error Code

P3

P2

_swStep(0,0)

_swStep(1,0)

swStep(2,0) swStep(2,1) swStep(2,2)

left(0,1)

above(1,0)

Page 52: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

swStep(0,0)

Smith-Waterman

52

swStep(1,0)

P0

P1

Error Code

P3

P2

_swStep(0,0)

_swStep(1,0)

swStep(2,0) swStep(2,1) swStep(2,2)

left(0,1)

above(1,0)

above(2,0)

left(1,1)

Page 53: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

swStep(0,0)

CnC Smith-Waterman

53

swStep(0,2)swStep(0,1)

swStep(1,0)

swStep(2,0)

swStep(1,1) swStep(1,2)

swStep(2,1) swStep(2,2)

P0

P1

P2

Error Code Error Code Error Code

Page 54: Towards Resilient CnC-OCR · 2016-10-03 · CnC-2016: The 8th Annual Concurrent Collections Workshop Rochester, New York, 27 September 2016 Sara S. Hamouda*, Sanjay Chatterjee**,

Conclusion• OCR with user level fault tolerance

๏ Failure detection: using MPI-ULFM ๏ Failure propagation: using local proxy event

• CnC-OCR single assignment, global dependence knowledge and tunings simplify application recovery

• Future Work: ๏ Performance Evaluation ๏ Support a more general subset of OCR

54