Sound and Precise Analysis ofParallel Programs through
Schedule Specialization
Jingyue Wu, Yang Tang, Gang Hu, Heming Cui, Junfeng YangColumbia University
1
2
Motivation
soundness (# of analyzed schedules / # of total schedules)
precision Total Schedules
AnalyzedSchedules
StaticAnalysis
DynamicAnalysis
AnalyzedSchedules
?
• Analyzing parallel programs is difficult.
3
• Precision: Analyze the program over a small set of schedules. • Soundness: Enforce these schedules at runtime.
Schedule Specialization
soundness (# of analyzed schedules / # of total schedules)
precision Total Schedules
StaticAnalysis
DynamicAnalysis
AnalyzedSchedules
EnforcedSchedules
ScheduleSpecialization
4
Enforcing Schedules Using Peregrine
• Deterministic multithreading– e.g. DMP (ASPLOS ’09), Kendo (ASPLOS ’09), CoreDet
(ASPLOS ’10), Tern (OSDI ’10), Peregrine (SOSP ’11), DTHREADS (SOSP ’11)
– Performance overhead• e.g. Kendo: 16%, Tern & Peregrine: 39.1%
• Peregrine– Record schedules, and reuse them on a wide range of
inputs.– Represent schedules explicitly.
5
• Precision: Analyze the program over a small set of schedules. • Soundness: Enforce these schedules at runtime.
Schedule Specialization
soundness (# of analyzed schedules / # of total schedules)
precision
StaticAnalysis
DynamicAnalysis
AnalyzedSchedules
EnforcedSchedulesSchedule
Specialization
6
Framework
• Extract control flow and data flow enforced by a set of schedules
Schedule
ScheduleSpecializationProgram
C/C++ programwith Pthread
Total order ofsynchronizations
SpecializedProgram
Extra def-usechains
7
Outline
• Example• Control-Flow Specialization• Data-Flow Specialization• Results• Conclusion
Running Example
int results[p_max];int global_id = 0;
int main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0;}
void *worker(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}
8
Thread 0 Thread 1 Thread 2
create
create
join
join
lock
unlocklock
unlock
Race-free?
Control-Flow Specializationint main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0;}
9
create
create
join
join
atoi
++i
create
return
i = 0
i < p
++i
join
i < p
i = 0
Control-Flow Specializationint main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0;}
10
create
create
join
join
atoi
++i
create
return
i = 0
i < p
++i
join
i < p
i = 0
atoi
create
i = 0
i < p
Control-Flow Specializationint main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0;}
11
create
create
join
join
atoi
++i
create
return
i = 0
i < p
++i
join
i < p
i = 0
create
atoi
i = 0
i < p
create
++i
create
i < p
Control-Flow Specializationint main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0;}
12
create
create
join
join
atoi
++i
create
return
i = 0
i < p
++i
join
i < p
i = 0
atoi
create
i = 0
i < p
++i
create
i < p
++i
i < p
join
i < p
i = 0
++i
join
i < p
++i
i < p
return
13
Control-Flow Specialized Program
int main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); i = 0; // i < p == true pthread_create(&child[i], 0, worker.clone1, 0); ++i; // i < p == true pthread_create(&child[i], 0, worker.clone2, 0); ++i; // i < p == false i = 0; // i < p == true pthread_join(child[i], 0); ++i; // i < p == true pthread_join(child[i], 0); ++i; // i < p == false return 0;}
atoi
create
i = 0
i < p
++i
create
i < p
++i
i < p
join
i < p
i = 0
++i
join
i < p
++i
i < p
return
14
More Challenges onControl-Flow Specialization
• Ambiguity
call
Caller Callee
call
S1
• A schedule has too many synchronizations
ret
S2
Data-Flow Specialization
int global_id = 0;
void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}
void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}
15
Thread 0 Thread 1 Thread 2
create
create
join
join
lock
unlock
lock
unlock
global_id = 0
my_id = global_idglobal_id++
my_id = global_idglobal_id++
Data-Flow Specialization
int global_id = 0;
void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}
void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}
16
Thread 0 Thread 1 Thread 2
create
create
join
join
lock
unlock
lock
unlock
global_id = 0
my_id = global_idglobal_id++
my_id = global_idglobal_id++
Data-Flow Specialization
int global_id = 0;
void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}
void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}
17
Thread 0 Thread 1 Thread 2
create
create
join
join
lock
unlock
lock
unlock
global_id = 0
my_id = 0global_id = 1
my_id = global_idglobal_id++
Data-Flow Specialization
int global_id = 0;
void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}
void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0;}
18
Thread 0 Thread 1 Thread 2
create
create
join
join
lock
unlock
lock
unlock
global_id = 0
my_id = 0global_id = 1
my_id = 1global_id = 2
Data-Flow Specialization
int global_id = 0;
void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); global_id = 1; pthread_mutex_unlock(&global_id_lock); results[0] = compute(0); return 0;}
void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); global_id = 2; pthread_mutex_unlock(&global_id_lock); results[1] = compute(1); return 0;}
19
Thread 0 Thread 1 Thread 2
create
create
join
join
lock
unlock
lock
unlock
global_id = 0
my_id = 0global_id = 1
my_id = 1global_id = 2
20
More Challenges onData-Flow Specialization
• Must/May alias analysis– global_id
• Reasoning about integers– results[0] = compute(0)– results[1] = compute(1)
• Many def-use chains
21
Evaluation
• Applications– Static race detector– Alias analyzer– Path slicer
• Programs– PBZip2 1.1.5– aget 0.4.1– 8 programs in SPLASH2– 7 programs in PARSEC
22
Program Original Specialized
aget 72 0
PBZip2 125 0
fft 96 0
blackscholes 3 0
swaptions 165 0
streamcluster 4 0
canneal 21 0
bodytrack 4 0
ferret 6 0
raytrace 215 0
cholesky 31 7
radix 53 14
water-spatial 2447 1799
lu-contig 18 18
barnes 370 369
water-nsquared 354 333
ocean 331 292
StaticRaceDetector
# of FalsePositives
23
Program Original Specialized
aget 72 0
PBZip2 125 0
fft 96 0
blackscholes 3 0
swaptions 165 0
streamcluster 4 0
canneal 21 0
bodytrack 4 0
ferret 6 0
raytrace 215 0
cholesky 31 7
radix 53 14
water-spatial 2447 1799
lu-contig 18 18
barnes 370 369
water-nsquared 354 333
ocean 331 292
StaticRaceDetector
# of FalsePositives
24
Program Original Specialized
aget 72 0
PBZip2 125 0
fft 96 0
blackscholes 3 0
swaptions 165 0
streamcluster 4 0
canneal 21 0
bodytrack 4 0
ferret 6 0
raytrace 215 0
cholesky 31 7
radix 53 14
water-spatial 2447 1799
lu-contig 18 18
barnes 370 369
water-nsquared 354 333
ocean 331 292
StaticRaceDetector
# of FalsePositives
25
Program Original Specialized
aget 72 0
PBZip2 125 0
fft 96 0
blackscholes 3 0
swaptions 165 0
streamcluster 4 0
canneal 21 0
bodytrack 4 0
ferret 6 0
raytrace 215 0
cholesky 31 7
radix 53 14
water-spatial 2447 1799
lu-contig 18 18
barnes 370 369
water-nsquared 354 333
ocean 331 292
StaticRaceDetector
# of FalsePositives
26
Static Race Detector: Harmful Races Detected
• 4 in aget• 2 in radix• 1 in fft
27
Precision of Schedule-AwareAlias Analysis
28
Precision of Schedule-AwareAlias Analysis
29
Precision of Schedule-AwareAlias Analysis
30
Conclusion and Future Work
• Designed and implemented schedule specialization framework– Analyzes the program over a small set of schedules– Enforces these schedules at runtime
• Built and evaluated three applications– Easy to use– Precise
• Future work– More applications– Similar specialization ideas on sequential programs
31
Related Work• Program analysis for parallel programs
– Chord (PLDI ’06), RADAR (PLDI ’08), FastTrack (PLDI ’09)• Slicing
– Horgon (PLDI ’90), Bouncer (SOSP ’07), Jhala (PLDI ’05), Weiser (PhD thesis), Zhang (PLDI ’04)
• Deterministic multithreading– DMP (ASPLOS ’09), Kendo (ASPLOS ’09), CoreDet (ASPLOS ’10),
Tern (OSDI ’10), Peregrine (SOSP ’11), DTHREADS (SOSP ’11)• Program specialization
– Consel (POPL ’93), Gluck (ISPL ’95), Jørgensen (POPL ’92), Nirkhe (POPL ’92), Reps (PDSPE ’96)
32
Backup Slides
33
Specialization Time
34
Handling Races
• We do not assume data-race freedom. • We could if our only goal is optimization.
35
Input Coverage
• Use runtime verification for the inputs not covered
• A small set of schedules can cover a wide range of inputs
36