Jonathan Li˜ander (UIUC), Sriram Krishnamoorthy …charm.cs.illinois.edu/newPapers/13-35/poster-pldi-vert.pdfAnnotated Example Program Steal Tree: Low-Overhead Tracing of Work Stealing

Ann

otat

ed E

xam

ple

Prog

ram

Steal Tree: Low-Overhead Tracing of Work Stealing SchedulersJonathan Li�ander (UIUC), Sriram Krishnamoorthy (PNNL), Laxmikant V. Kale (UIUC)

Help-�rst Scheduling Policy Work-�rst Scheduling Policy

Shap

shot

1st

Ste

alSh

apsh

ot 2

nd S

teal

Shap

shot

3rd

Ste

al

s1 1:0

level

{

2

1

3 {

{

c2

.. 1:1

..s5 2:0

w

2:1

steal end

local end

.. 1:1

.. 2:1

2:0

c1

c2

1:1

1:0 1:0

2:0

s1 1:0

level

{

2

1

3 {

{

c2

s7

c3

2:0

x

s2 ..

..

1:1

2:1

steal end

local end

.. 1:1

.. 2:1

2:0

c1

c2

1:1

1:0 1:1 2:0

2:0

c3 1:0 1:0

s1 1:0

level

{

2

1

3 {

{

c2

s7

c3

2:0

x

s2 .. 1:1

c4.. 2:1

steal end

local end

.. 2:1

2:0

c1

c2

1:1

1:0 1:1 2:1

2:0

c3 1:0 1:0 c4

2:0 1:0

Shap

shot

1st

Ste

alSh

apsh

ot 2

nd S

teal

Shap

shot

3rd

Ste

al

Current Execution State Deque Steal Tree (after steal) Current Execution State Deque Steal Tree (after steal)

Execution Time Overhead

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

AllQueens Heat Fib FFT Strassen NBody Cholesky LU Matmul

2 Threads4 Threads8 Threads

16 Threads


120 Threads

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

AQ-HF AQ-WF SCF-HF SCF-WF TCE-HF TCE-WF PG-HF PG-WF

2000 Cores4000 Cores8000 Cores

16000 Cores32000 Cores

Storage Overhead

0

1

2

3

4

5

AQ-HF AQ-WF SCF-HF SCF-WF TCE-HF TCE-WF PG-HF PG-WF

KB

/C

ore

2000 Cores4000 Cores8000 Cores

16000 Cores32000 Cores

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

AllQueens Heat Fib FFT Strassen NBody

KB

/T

hre

ad


16 Threads32 Threads64 Threads96 Threads

120 Threads

0

10

20

30

40

50

60

Cholesky LU Matmul

0

20

40

60

80

100

AllQueens LU Heat NBody Matmul


16 Threads32 Threads64 Threads96 Threads

120 Threads

Percent Reduction DPST Traversals Utilization Plot of Cilk LU

0

20

40

60

80

100

0 0.1445 0.289 0.4334 0.5779 0.7224 0.8669 1.0113 1.1558 1.3003 1.4448 1.5892 1.7337 1.8782 2.0227 2.1671 2.3116 2.4561 2.6006 2.745 2.8895

Resu

lts

c1

c2

c2

IntroductionWork stealing is a popular approach to scheduling task-parallel programs, used in many programming models including OpenMP 3.0, Java Concurrency Utilities, Intel TBB, Cilk, and X10. The �exibility inherent in work stealing when dealing with load imbalance results in seemingly irregular computation structures, complicating the study of its runtime behavior. We present an approach to e�ciently trace async-�nish parallel programs scheduled using work stealing.

Tracing AlgorithmsRather than trace individual tasks, we exploit the structure of work stealing schedulers to coarsen the events traced. In particular, we con-struct a steal tree: a tree of steal operations that partitions the program execution into groups of tasks. We identify the key properties of two scheduling policies that enables the steal tree to be compactly represented.

Primary Contributions- Identifying the key properties of work-�rst and help-�rst schedulers to compactly represent the stealing relationships.- Algorithms that exploit these properties to trace and replay async-�nish programs by e�ciently constructing the steal tree.- Demonstration of low space overheads and within-variance perturbation of execution in tracing work stealing schedulers.- Applying the tracing to two distinct contexts: data-race detection and retentive stealingApplications- We show a substantial reduction in the cost of data race detection using an algorithm that expoits our traces to maintain and traverse the dynamic program structure tree (DPST). In some cases, we observe up to a 70% reduction in tree traversals.- Using our traces we are able to reduce the memory requirements for retentive stealing by not explictly enumerating tasks.

Presentation on Wednesday (June 19th) at 3:30pm in the Seattle 2&3 room

fn() { s1; async { s5; async w; s6; } s2; finish { s7; async x; s8; async y; s9; async z; s10; } s3; async { s11; } s4;}

c2

level

{ 1

2

3 {

{

z

s7 s8 s9 s10

s1 s2 s3 1:0 1:1

2:0 2:1 2:2

steal end

local end

2:0

1:0

s3 1:1

2:1

2:2

c2 c1

c2

1:1

1:0 1:0

2:0

c2

level

{ 1

2

3 {

{

z

s7 s8 s9 s10

s1 s2 s3 1:0 1:1

2:0 2:1 2:2

c3

steal end

local end

c3

2:0

s3 1:1

2:1

2:2

c1

c2

1:2

1:0 1:0 c3 1:1 1:0

2:0

c2

level

{ 1

2

3 {

{

z

s7 s8 s9 s10

s1 s2 s3 1:0 1:1

2:1 2:2

c3

c4

2:0

steal end

local end

c4 2:0

2:1

2:2

c1

c2

1:2

1:0 1:0 c3 1:1 1:0

2:1

c4 2:1

1:0

Documents

Jonathan Li˜ander (UIUC), Sriram Krishnamoorthy …charm.cs.illinois.edu/newPapers/13-35/poster-pldi-vert.pdfAnnotated Example Program Steal Tree: Low-Overhead Tracing of Work Stealing