1
Annotated Example Program Steal Tree: Low-Overhead Tracing of Work Stealing Schedulers Jonathan Lifflander (UIUC), Sriram Krishnamoorthy (PNNL), Laxmikant V. Kale (UIUC) Help-first Scheduling Policy Work-first Scheduling Policy Shapshot 1st Steal Shapshot 2nd Steal Shapshot 3rd Steal s1 1:0 level { 2 1 3 { { c2 .. 1:1 .. s5 2:0 w 2:1 steal end local end .. 1:1 .. 2:1 2:0 c1 c2 1:1 1:0 1:0 2:0 s1 1:0 level { 2 1 3 { { c2 s7 c3 2:0 x s2 .. .. 1:1 2:1 steal end local end .. 1:1 .. 2:1 2:0 c1 c2 1:1 1:0 1:1 2:0 2:0 c3 1:0 1:0 s1 1:0 level { 2 1 3 { { c2 s7 c3 2:0 x s2 .. 1:1 c4 .. 2:1 steal end local end .. 2:1 2:0 c1 c2 1:1 1:0 1:1 2:1 2:0 c3 1:0 1:0 c4 2:0 1:0 Shapshot 1st Steal Shapshot 2nd Steal Shapshot 3rd Steal Current Execution State Deque Steal Tree (after steal) Current Execution State Deque Steal Tree (after steal) Execution Time Overhead 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 AllQueens Heat Fib FFT Strassen NBody Cholesky LU Matmul 2 Threads 4 Threads 8 Threads 16 Threads 32 Threads 64 Threads 96 Threads 120 Threads 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 AQ-HF AQ-WF SCF-HF SCF-WF TCE-HF TCE-WF PG-HF PG-WF 2000 Cores 4000 Cores 8000 Cores 16000 Cores 32000 Cores Storage Overhead 0 1 2 3 4 5 AQ-HF AQ-WF SCF-HF SCF-WF TCE-HF TCE-WF PG-HF PG-WF KB/Core 2000 Cores 4000 Cores 8000 Cores 16000 Cores 32000 Cores 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 AllQueens Heat Fib FFT Strassen NBody KB/Thread 2 Threads 4 Threads 8 Threads 16 Threads 32 Threads 64 Threads 96 Threads 120 Threads 0 10 20 30 40 50 60 Cholesky LU Matmul 0 20 40 60 80 100 AllQueens LU Heat NBody Matmul 2 Threads 4 Threads 8 Threads 16 Threads 32 Threads 64 Threads 96 Threads 120 Threads Percent Reduction DPST Traversals Utilization Plot of Cilk LU 0 20 40 60 80 100 0 0.1445 0.289 0.4334 0.5779 0.7224 0.8669 1.0113 1.1558 1.3003 1.4448 1.5892 1.7337 1.8782 2.0227 2.1671 2.3116 2.4561 2.6006 2.745 2.8895 Results c1 c2 c2 Introduction Work stealing is a popular approach to scheduling task-parallel programs, used in many programming models including OpenMP 3.0, Java Concurrency Utilities, Intel TBB, Cilk, and X10. The flexibility inherent in work stealing when dealing with load imbalance results in seemingly irregular computation structures, complicating the study of its runtime behavior. We present an approach to efficiently trace async-finish parallel programs scheduled using work stealing. Tracing Algorithms Rather than trace individual tasks, we exploit the structure of work stealing schedulers to coarsen the events traced. In particular, we con- struct a steal tree: a tree of steal operations that partitions the program execution into groups of tasks. We identify the key properties of two scheduling policies that enables the steal tree to be compactly represented. Primary Contributions - Identifying the key properties of work-first and help-first schedulers to compactly represent the stealing relationships. - Algorithms that exploit these properties to trace and replay async-finish programs by efficiently constructing the steal tree. - Demonstration of low space overheads and within-variance perturbation of execution in tracing work stealing schedulers. - Applying the tracing to two distinct contexts: data-race detection and retentive stealing Applications - We show a substantial reduction in the cost of data race detection using an algorithm that expoits our traces to maintain and traverse the dynamic program structure tree (DPST). In some cases, we observe up to a 70% reduction in tree traversals. - Using our traces we are able to reduce the memory requirements for retentive stealing by not explictly enumerating tasks. Presentation on Wednesday (June 19th) at 3:30pm in the Seattle 2&3 room fn() { s1; async { s5; async w; s6; } s2; finish { s7; async x; s8; async y; s9; async z; s10; } s3; async { s11; } s4; } c2 level { 1 2 3 { { z s7 s8 s9 s10 s1 s2 s3 1:0 1:1 2:0 2:1 2:2 steal end local end 2:0 1:0 s3 1:1 2:1 2:2 c2 c1 c2 1:1 1:0 1:0 2:0 c2 level { 1 2 3 { { z s7 s8 s9 s10 s1 s2 s3 1:0 1:1 2:0 2:1 2:2 c3 steal end local end c3 2:0 s3 1:1 2:1 2:2 c1 c2 1:2 1:0 1:0 c3 1:1 1:0 2:0 c2 level { 1 2 3 { { z s7 s8 s9 s10 s1 s2 s3 1:0 1:1 2:1 2:2 c3 c4 2:0 steal end local end c4 2:0 2:1 2:2 c1 c2 1:2 1:0 1:0 c3 1:1 1:0 2:1 c4 2:1 1:0

Jonathan Li˜ander (UIUC), Sriram Krishnamoorthy …charm.cs.illinois.edu/newPapers/13-35/poster-pldi-vert.pdfAnnotated Example Program Steal Tree: Low-Overhead Tracing of Work Stealing

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Jonathan Li˜ander (UIUC), Sriram Krishnamoorthy …charm.cs.illinois.edu/newPapers/13-35/poster-pldi-vert.pdfAnnotated Example Program Steal Tree: Low-Overhead Tracing of Work Stealing

Ann

otat

ed E

xam

ple

Prog

ram

Steal Tree: Low-Overhead Tracing of Work Stealing SchedulersJonathan Li�ander (UIUC), Sriram Krishnamoorthy (PNNL), Laxmikant V. Kale (UIUC)

Help-�rst Scheduling Policy Work-�rst Scheduling Policy

Shap

shot

1st

Ste

alSh

apsh

ot 2

nd S

teal

Shap

shot

3rd

Ste

al

s1 1:0

level

{

2

1

3 {

{

c2

.. 1:1

..s5 2:0

w

2:1

steal end

local end

.. 1:1

.. 2:1

2:0

c1

c2

1:1

1:0 1:0

2:0

s1 1:0

level

{

2

1

3 {

{

c2

s7

c3

2:0

x

s2 ..

..

1:1

2:1

steal end

local end

.. 1:1

.. 2:1

2:0

c1

c2

1:1

1:0 1:1 2:0

2:0

c3 1:0 1:0

s1 1:0

level

{

2

1

3 {

{

c2

s7

c3

2:0

x

s2 .. 1:1

c4.. 2:1

steal end

local end

.. 2:1

2:0

c1

c2

1:1

1:0 1:1 2:1

2:0

c3 1:0 1:0 c4

2:0 1:0

Shap

shot

1st

Ste

alSh

apsh

ot 2

nd S

teal

Shap

shot

3rd

Ste

al

Current Execution State Deque Steal Tree (after steal) Current Execution State Deque Steal Tree (after steal)

Execution Time Overhead

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

AllQueens Heat Fib FFT Strassen NBody Cholesky LU Matmul

2 Threads4 Threads8 Threads

16 Threads

32 Threads64 Threads96 Threads

120 Threads

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

AQ-HF AQ-WF SCF-HF SCF-WF TCE-HF TCE-WF PG-HF PG-WF

2000 Cores4000 Cores8000 Cores

16000 Cores32000 Cores

Storage Overhead

0

1

2

3

4

5

AQ-HF AQ-WF SCF-HF SCF-WF TCE-HF TCE-WF PG-HF PG-WF

KB

/C

ore

2000 Cores4000 Cores8000 Cores

16000 Cores32000 Cores

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

AllQueens Heat Fib FFT Strassen NBody

KB

/T

hre

ad

2 Threads4 Threads8 Threads

16 Threads32 Threads64 Threads96 Threads

120 Threads

0

10

20

30

40

50

60

Cholesky LU Matmul

0

20

40

60

80

100

AllQueens LU Heat NBody Matmul

2 Threads4 Threads8 Threads

16 Threads32 Threads64 Threads96 Threads

120 Threads

Percent Reduction DPST Traversals Utilization Plot of Cilk LU

0

20

40

60

80

100

0 0.1445 0.289 0.4334 0.5779 0.7224 0.8669 1.0113 1.1558 1.3003 1.4448 1.5892 1.7337 1.8782 2.0227 2.1671 2.3116 2.4561 2.6006 2.745 2.8895

Resu

lts

c1

c2

c2

IntroductionWork stealing is a popular approach to scheduling task-parallel programs, used in many programming models including OpenMP 3.0, Java Concurrency Utilities, Intel TBB, Cilk, and X10. The �exibility inherent in work stealing when dealing with load imbalance results in seemingly irregular computation structures, complicating the study of its runtime behavior. We present an approach to e�ciently trace async-�nish parallel programs scheduled using work stealing.

Tracing AlgorithmsRather than trace individual tasks, we exploit the structure of work stealing schedulers to coarsen the events traced. In particular, we con-struct a steal tree: a tree of steal operations that partitions the program execution into groups of tasks. We identify the key properties of two scheduling policies that enables the steal tree to be compactly represented.

Primary Contributions- Identifying the key properties of work-�rst and help-�rst schedulers to compactly represent the stealing relationships.- Algorithms that exploit these properties to trace and replay async-�nish programs by e�ciently constructing the steal tree.- Demonstration of low space overheads and within-variance perturbation of execution in tracing work stealing schedulers.- Applying the tracing to two distinct contexts: data-race detection and retentive stealingApplications- We show a substantial reduction in the cost of data race detection using an algorithm that expoits our traces to maintain and traverse the dynamic program structure tree (DPST). In some cases, we observe up to a 70% reduction in tree traversals.- Using our traces we are able to reduce the memory requirements for retentive stealing by not explictly enumerating tasks.

Presentation on Wednesday (June 19th) at 3:30pm in the Seattle 2&3 room

fn() { s1; async { s5; async w; s6; } s2; finish { s7; async x; s8; async y; s9; async z; s10; } s3; async { s11; } s4;}

c2

level

{ 1

2

3 {

{

z

s7 s8 s9 s10

s1 s2 s3 1:0 1:1

2:0 2:1 2:2

steal end

local end

2:0

1:0

s3 1:1

2:1

2:2

c2 c1

c2

1:1

1:0 1:0

2:0

c2

level

{ 1

2

3 {

{

z

s7 s8 s9 s10

s1 s2 s3 1:0 1:1

2:0 2:1 2:2

c3

steal end

local end

c3

2:0

s3 1:1

2:1

2:2

c1

c2

1:2

1:0 1:0 c3 1:1 1:0

2:0

c2

level

{ 1

2

3 {

{

z

s7 s8 s9 s10

s1 s2 s3 1:0 1:1

2:1 2:2

c3

c4

2:0

steal end

local end

c4 2:0

2:1

2:2

c1

c2

1:2

1:0 1:0 c3 1:1 1:0

2:1

c4 2:1

1:0