Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Ann
otat
ed E
xam
ple
Prog
ram
Steal Tree: Low-Overhead Tracing of Work Stealing SchedulersJonathan Li�ander (UIUC), Sriram Krishnamoorthy (PNNL), Laxmikant V. Kale (UIUC)
Help-�rst Scheduling Policy Work-�rst Scheduling Policy
Shap
shot
1st
Ste
alSh
apsh
ot 2
nd S
teal
Shap
shot
3rd
Ste
al
s1 1:0
level
{
2
1
3 {
{
c2
.. 1:1
..s5 2:0
w
2:1
steal end
local end
.. 1:1
.. 2:1
2:0
c1
c2
1:1
1:0 1:0
2:0
s1 1:0
level
{
2
1
3 {
{
c2
s7
c3
2:0
x
s2 ..
..
1:1
2:1
steal end
local end
.. 1:1
.. 2:1
2:0
c1
c2
1:1
1:0 1:1 2:0
2:0
c3 1:0 1:0
s1 1:0
level
{
2
1
3 {
{
c2
s7
c3
2:0
x
s2 .. 1:1
c4.. 2:1
steal end
local end
.. 2:1
2:0
c1
c2
1:1
1:0 1:1 2:1
2:0
c3 1:0 1:0 c4
2:0 1:0
Shap
shot
1st
Ste
alSh
apsh
ot 2
nd S
teal
Shap
shot
3rd
Ste
al
Current Execution State Deque Steal Tree (after steal) Current Execution State Deque Steal Tree (after steal)
Execution Time Overhead
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
AllQueens Heat Fib FFT Strassen NBody Cholesky LU Matmul
2 Threads4 Threads8 Threads
16 Threads
32 Threads64 Threads96 Threads
120 Threads
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
AQ-HF AQ-WF SCF-HF SCF-WF TCE-HF TCE-WF PG-HF PG-WF
2000 Cores4000 Cores8000 Cores
16000 Cores32000 Cores
Storage Overhead
0
1
2
3
4
5
AQ-HF AQ-WF SCF-HF SCF-WF TCE-HF TCE-WF PG-HF PG-WF
KB
/C
ore
2000 Cores4000 Cores8000 Cores
16000 Cores32000 Cores
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
AllQueens Heat Fib FFT Strassen NBody
KB
/T
hre
ad
2 Threads4 Threads8 Threads
16 Threads32 Threads64 Threads96 Threads
120 Threads
0
10
20
30
40
50
60
Cholesky LU Matmul
0
20
40
60
80
100
AllQueens LU Heat NBody Matmul
2 Threads4 Threads8 Threads
16 Threads32 Threads64 Threads96 Threads
120 Threads
Percent Reduction DPST Traversals Utilization Plot of Cilk LU
0
20
40
60
80
100
0 0.1445 0.289 0.4334 0.5779 0.7224 0.8669 1.0113 1.1558 1.3003 1.4448 1.5892 1.7337 1.8782 2.0227 2.1671 2.3116 2.4561 2.6006 2.745 2.8895
Resu
lts
c1
c2
c2
IntroductionWork stealing is a popular approach to scheduling task-parallel programs, used in many programming models including OpenMP 3.0, Java Concurrency Utilities, Intel TBB, Cilk, and X10. The �exibility inherent in work stealing when dealing with load imbalance results in seemingly irregular computation structures, complicating the study of its runtime behavior. We present an approach to e�ciently trace async-�nish parallel programs scheduled using work stealing.
Tracing AlgorithmsRather than trace individual tasks, we exploit the structure of work stealing schedulers to coarsen the events traced. In particular, we con-struct a steal tree: a tree of steal operations that partitions the program execution into groups of tasks. We identify the key properties of two scheduling policies that enables the steal tree to be compactly represented.
Primary Contributions- Identifying the key properties of work-�rst and help-�rst schedulers to compactly represent the stealing relationships.- Algorithms that exploit these properties to trace and replay async-�nish programs by e�ciently constructing the steal tree.- Demonstration of low space overheads and within-variance perturbation of execution in tracing work stealing schedulers.- Applying the tracing to two distinct contexts: data-race detection and retentive stealingApplications- We show a substantial reduction in the cost of data race detection using an algorithm that expoits our traces to maintain and traverse the dynamic program structure tree (DPST). In some cases, we observe up to a 70% reduction in tree traversals.- Using our traces we are able to reduce the memory requirements for retentive stealing by not explictly enumerating tasks.
Presentation on Wednesday (June 19th) at 3:30pm in the Seattle 2&3 room
fn() { s1; async { s5; async w; s6; } s2; finish { s7; async x; s8; async y; s9; async z; s10; } s3; async { s11; } s4;}
c2
level
{ 1
2
3 {
{
z
s7 s8 s9 s10
s1 s2 s3 1:0 1:1
2:0 2:1 2:2
steal end
local end
2:0
1:0
s3 1:1
2:1
2:2
c2 c1
c2
1:1
1:0 1:0
2:0
c2
level
{ 1
2
3 {
{
z
s7 s8 s9 s10
s1 s2 s3 1:0 1:1
2:0 2:1 2:2
c3
steal end
local end
c3
2:0
s3 1:1
2:1
2:2
c1
c2
1:2
1:0 1:0 c3 1:1 1:0
2:0
c2
level
{ 1
2
3 {
{
z
s7 s8 s9 s10
s1 s2 s3 1:0 1:1
2:1 2:2
c3
c4
2:0
steal end
local end
c4 2:0
2:1
2:2
c1
c2
1:2
1:0 1:0 c3 1:1 1:0
2:1
c4 2:1
1:0