Upload
zita
View
44
Download
0
Embed Size (px)
DESCRIPTION
Efficient Data Race Detection for Distributed Memory Parallel Programs CS267 Spring 2012 (3/8) O riginally presented at Supercomputing 2011 (Seattle, WA). Chang-Seo Park and Koushik Sen University of California Berkeley Paul Hargove and Costin Iancu Lawrence Berkeley Laboratory. - PowerPoint PPT Presentation
Citation preview
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Efficient Data Race Detection for Distributed Memory Parallel Programs
CS267 Spring 2012 (3/8) Originally presented at Supercomputing 2011 (Seattle, WA)
Chang-Seo Park and Koushik Sen University of California Berkeley
Paul Hargove and Costin IancuLawrence Berkeley Laboratory
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
2
Current State of Parallel Programming
Parallelism everywhere! Top supercomputer has 500K+ cores Quad-core standard on desktop / laptops Dual-core smartphones
Parallelism and concurrency make programming harder Scheduling non-determinism may cause subtle bugs
But, limited usage of testing and correctness tools We like hero programmers Hero programmers can find bugs (in sequential code) Tools are hard to find and use
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
3
Outline
Introduction Example Bug and Motivation Efficient Data Race Detection with Active Testing
Prediction phase Confirmation phase
HOWTO: Primer on using UPC-Thrille Conclusion Q&A and Project Ideas
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
4
Example Parallel Program
Simple matrix vector multiply
A b c = ×
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
5
Example Parallel Program in UPC
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
C A B
11
?? = 1 1
1 1
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
6
Example Parallel Program in UPC
C A B
11
22 = 1 1
1 1
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
7
UPC Example: Problem?
C A B
11
22 = 1 1
1 1
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
No apparent bug in this program.
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
8
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
UPC Example: Data Race
No apparent bug in this program.But, if we call
matvec(A,B,B)?
Data Race!
B A B
11
11 = 1 1
1 1
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
9
UPC Example: Data Race
B A B
21
21 = 1 1
1 1
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
Data Race! No apparent bug in this program.But, if we call
matvec(A,B,B)?
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
10
UPC Example: Data Race
B A B
23
23 = 1 1
1 1
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
Data Race! No apparent bug in this program.But, if we call
matvec(A,B,B)?
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
11
UPC Example: Trace
Example Trace: 4: sum[0] = 0;4: sum[1] = 0;6: sum[0]+= A[0]
[0]*B[0];6: sum[1]+= A[1]
[0]*B[0]; 6: sum[0]+= A[0]
[1]*B[1];6: sum[1]+= A[1]
[1]*B[1]; 9: B[0] = sum[0];9: B[1] = sum[1];
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
Data Race?
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
12
Example Trace: 4: sum[0] = 0;4: sum[1] = 0;6: sum[0]+= A[0]
[0]*B[0];6: sum[0]+= A[0]
[1]*B[1];6: sum[1]+= A[1]
[0]*B[0]; 9: B[0] = sum[0];6: sum[1]+= A[1]
[1]*B[1]; 9: B[1] = sum[1];
UPC Example: Trace
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
Would be nice to have a trace exhibiting the data race
Data Race!
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
13
UPC Example: Trace
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
Would be nice to have a trace exhibiting the assertion failure
Example Trace: 4: sum[0] = 0;4: sum[1] = 0;6: sum[0]+= A[0]
[0]*B[0];6: sum[0]+= A[0]
[1]*B[1];9: B[0] = sum[0];6: sum[1]+= A[1]
[0]*B[0]; 6: sum[1]+= A[1]
[1]*B[1]; 9: B[1] = sum[1];
Data Race!
C != A*B
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
14
Desiderata
Would be nice to have a trace Showing a data race (or some other concurrency bug) Showing an assertion violation due to a data race (or some other
visible artifact)
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
15
Active Testing
Would be nice to have a trace Showing a data race (or some other concurrency bug) Showing an assertion violation due to a data race (or some other
visible artifact) Leverage program analysis to make testing quickly find
real concurrency bugs Phase 1: Use imprecise static or dynamic program analysis to
find bug patterns where a potential concurrency bug can happen (Race Detector)
Phase 2: Directed testing to confirm potential bugs as real(Race Tester)
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
16
Active Testing: Phase 1
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
17
Active Testing: Phase 1
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)1. Insert instrumentation at
compile time
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
18
Active Testing: Phase 1
2. Run instrumented program normally and obtain trace
Generated Trace: 4: sum[0] = 0;4: sum[1] = 0;6: sum[0]+= A[0]
[0]*B[0];6: sum[1]+= A[1]
[0]*B[0]; 6: sum[0]+= A[0]
[1]*B[1];6: sum[1]+= A[1]
[1]*B[1]; 9: B[0] = sum[0];9: B[1] = sum[1];
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
19
Generated Trace: 4: sum[0] = 0;4: sum[1] = 0;6: sum[0]+= A[0]
[0]*B[0];6: sum[1]+= A[1]
[0]*B[0]; 6: sum[0]+= A[0]
[1]*B[1];6: sum[1]+= A[1]
[1]*B[1]; 9: B[0] = sum[0];9: B[1] = sum[1];
Active Testing: Phase 1
3. Algorithm detects data races
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
20
Active Testing: Phase 1
3. Potential race between statements 6 and 9
Generated Trace: 4: sum[0] = 0;4: sum[1] = 0;6: sum[0]+= A[0]
[0]*B[0];6: sum[1]+= A[1]
[0]*B[0]; 6: sum[0]+= A[0]
[1]*B[1];6: sum[1]+= A[1]
[1]*B[1]; 9: B[0] = sum[0];9: B[1] = sum[1];
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
21
Active Testing: Phase 2
Goal 1. Confirm racesGoal 2. Create assertion failure
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
22
Generate this execution:
4: sum[0] = 0;4: sum[1] = 0;6: sum[0]+= A[0][0]*B[0];6: sum[0]+= A[0][1]*B[1];9: B[0] = sum[0];6: sum[1]+= A[1][0]*B[0]; 6: sum[1]+= A[1][1]*B[1]; 9: B[1] = sum[1];
Active Testing: Phase 2
Data Race!
Goal 1. Confirm racesGoal 2. Create assertion failure
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
23
Active Testing: Phase 2
Control scheduler knowing that (6,9) could race
Trace: 4: sum[1] = 0; 1: void matvec(shared [N] double A[N]
[N],shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
24
Active Testing: Phase 2
Control scheduler knowing that (6,9) could race
Trace: 4: sum[1] = 0;4: sum[0] = 0;
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
25
Active Testing: Phase 2
Control scheduler knowing that (6,9) could race
Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[1]+= A[1][0]*B[0];
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
26
Active Testing: Phase 2
Control scheduler knowing that (6,9) could race
Trace: 4: sum[1] = 0;4: sum[0] = 0;
Postponed: { 6: sum[1]+= A[1][0]*B[0]; }
Do not postponeif there is a deadlock
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
27
Active Testing: Phase 2
Control scheduler knowing that (6,9) could race Postponed: { 6: sum[1]+= A[1][0]*B[0]; }
Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[0]+= A[0][0]*B[0];
Do not postponeif there is a deadlock
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
28
Active Testing: Phase 2
Control scheduler knowing that (6,9) could race
Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[0]+= A[0][0]*B[0];6: sum[0]+= A[0][1]*B[1];
Do not postponeif there is a deadlock
Postponed: { 6: sum[1]+= A[1][0]*B[0]; }
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
29
Active Testing: Phase 2
Control scheduler knowing that (6,9) could race
Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[0]+= A[0][0]*B[0];6: sum[0]+= A[0][1]*B[1];9: B[0] = sum[0];
Postponed: { 6: sum[1]+= A[1][0]*B[0]; }
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
30
Active Testing: Phase 2
Control scheduler knowing that (6,9) could race
Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[0]+= A[0][0]*B[0];6: sum[0]+= A[0][1]*B[1];9: B[0] = sum[0];
Postponed: { 6: sum[1]+= A[1][0]*B[0]; }
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
Race? yes
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
31
Active Testing: Phase 2
Control scheduler knowing that (6,9) could race
Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[0]+= A[0][0]*B[0];6: sum[0]+= A[0][1]*B[1];
Postponed: {}
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
9:B[0]=sum[0]; 6:sum[1]+=A[1][0]*B[0];
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
32
Active Testing: Phase 2
Achieved Goal 1:Confirmed race.
Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[0]+= A[0][0]*B[0];6: sum[0]+= A[0][1]*B[1];9: B[0] = sum[0];6: sum[1]+= A[1][0]*B[0];
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
Racing Statements Temporally Adjacent
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
33
Active Testing: Phase 2
1: void matvec(shared [N] double A[N][N],
shared double B[N],shared double C[N]) {
2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)
Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[0]+= A[0][0]*B[0];6: sum[0]+= A[0][1]*B[1];9: B[0] = sum[0];6: sum[1]+= A[1][0]*B[0]; 6: sum[1]+= A[1][1]*B[1]; 9: B[1] = sum[1];
Achieved Goal 2:Assertion failure.
C != A*B
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
34
UPC-Thrille
Thread Interposition Library and Lightweight Extensions Framework for active testing UPC programs
Instrument UPC source code at compile time Using macro expansions, add hooks for analyses
Phase 1: Race detector Observe execution and predict which accesses may potentially
have a data race Phase 2: Race tester
Re-execute program while controlling the scheduler to create actual data race scenarios predicted in phase 1
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
35
UPC-Thrille
Extension to the Berkeley UPC compiler and runtime Unfortunately, disabled by default on NERSC clusters Fortunately, compilers with Thrille enabled are available
• /global/homes/p/parkcs/hopper/bin/thrille_upcc• /global/homes/p/parkcs/franklin/bin/thrille_upcc
You can also build Berkeley upcc with Thrille enabled from source by following the steps at http://upc.lbl.gov/thrille
Add switch “-thrille=[mode]” to (thrille enabled) upcc where [mode] is empty (default, no instrumentation) racer (phase 1 of race detection, predicts racy statement pairs) tester (phase 2, tries to create race on a given statement pair) Can also add “default_options = -thrille=[mode]” to ~/.upccrc
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
36
-thrille=racer
UPC-Thrille
No changes needed to source file(s) Separate binary for each analysis phase
Including one “empty” uninstrumened version Run each phase with corresponding binary
hello.upc
a.outupcc
upcrunb.outupcc
-thrille=tester
c.outupcc
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
37
UPC-Thrille: racer
-thrille=racer Finds potential data race pairs Records them in upct.race.N files (N=1,2,3,…)
Only works with static # of threads now (needs –T n) This limitation will be lifted soon
Example
$ upcc -T4 -thrille=racer matvec.upc -o matvec-racer$ upcrun matvec-racer (in an interactive batch job)…[2] Potential race #1 found:[2] Read from [0x3ff7004,0x3ff7008) by thread 2 at phase 4 (matvec.upc:17)[3] Write to [0x3ff7004,0x3ff7008) by thread 3 at phase 4 (matvec.upc:26)…
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
38
UPC-Thrille: tester
-thrille=tester Confirms data races predicted in phase 1 Reads in upct.race.N files (N=1,2,3,…) and tests individually
A script upctrun is provided to automatically test all races and skip equivalent ones One could also test a specific race with env. UPCT_RACE_ID=n
Example$ upcc -T4 -thrille=tester matvec.upc -o matvec-tester$ upctrun matvec-tester…('matvec.upc:17', 'matvec.upc:26') : (8, 1, True)…
# of equivalent races
# pairs tested
True if race confirmed
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
39
Limitations
Limitations of prediction phase Dynamic analysis can only analyze collected data Cannot predict races on parts of code that was not executed Cannot predict races on binary-only libraries whose source were
not instrumented Limitations of confirmation phase
Non-confirmation does not guarantee race freedom “Benign” data races
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
40
Conclusion
Active testing for finding bugs in parallel programs Combines dynamic analysis with testing Observe executions for potential concurrency bugs Re-execute to confirm bugs
UPC-Thrille is a efficient, scalable, and extensible analysis framework for UPC Currently provides race detection analysis Other analyses in progress (class projects?)
http://upc.lbl.gov/[email protected]
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
41
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
42
Optimization 1: Distributed Checking
Minimize interaction between threads Store shared memory accesses locally At barrier boundary, send access information to respective
owner of memory Conflict checking distributed among threads
T1 T2
Shared access after wait
notify
wait
notify
wait
access
notify
wait
notify
wait
T1 T2
Shared access after notify
notifywait
notifywait
access
notifywait
notifywait
notify
wait
notify
wait
T1 T2
Shared access between barriers
barrier
barrier
barrier
barrier
access
barrier
barrier
barrier
barrier
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
43
Optimization 2: Filter Redundancy
Information collected up to synchronization point may be redundant Reading and writing to same memory address Accessing same memory in different sizes or different locksets
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
44
Optimization 2: Filter Redundancy
Information collected up to synchronization point may be redundant Reading and writing to same memory address Accessing same memory in different sizes or different locksets
(Extended) Weaker-than relation Only keep the least protected accesses Prune provably redundant accesses [Choi et al ’02] Also reduces superfluous race reports
e1 e⊑ 2 (access e1 is weaker-than e2) iff larger memory range (e1.m e⊇ 2.m) accessed by more threads (e1.t = * e∨ 1.t = e2.t) smaller lockset (e1.L e⊆ 2.L) weaker access type (e1.a = Write e∨ 1.a = e2.a)
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
45
Optimization 3: Sampling
Scientific applications have tight loops Same computation and communication pattern each time Inefficient to check for races at every loop iteration
Reduce overhead by sampling Probabilistically sample each instrumentation point Reduce probability at each unsuccessful check Set probability to 0 when race found (disable check)
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
46
How Well Does it Scale?
Maximum 8% slowdown at 8K cores Franklin Cray XT4 Supercomputer at NERSC Quad-core 2.3GHz CPU and 8GB RAM per node Portals interconnect
Optimizations for scalability Efficient Data Structures Minimize Communication Sampling with Exponential Backoff
T1 T2notify
wait
notify
wait
access
notifywait
notify
wait
T1 T2notify
wait
notifywait
access
notifywait
notifywait
notify
wait
notify
wait
T1 T2barrierbarrier
barrierbarrier
access
barrierbarrier
barrierbarrier
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
47
Active Testing Cartoon: Phase I
Potential Collision
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
48
Active Testing Cartoon: Phase II
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
49
New Landscape for HPC
Shared memory for scalability and utilization Hybrid programming models: MPI + OpenMP PGAS: UPC, CAF, X10, etc.
Asynchronous access to shared data likely to cause bugs Unified Parallel C (UPC)
Parallel extensions to ISO C99 standard for shared and distributed memory hardware
Single Program Multiple Data (SPMD) +Partitioned Global Address Space (PGAS)
Shared memory concurrency Transparent access using pointers to shared data (array) Bulk transfers with memcpy, memput, memget Fine-grained (lock) and bulk (barrier) synchronization
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
50
Phase 1: Checking for Conflicts
To predict possible races, Need to check all shared accesses for conflicts Collect information through instrumentation
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
51
Phase 1: Checking for Conflicts
To predict possible races, Need to check all shared accesses for conflicts Collect information through instrumentation
Two accesses e1 = (m1, t1, L1, a1, p1, s1) and e2 = (m2, t2, L2, a2, p2, s2) are in conflict when memory range overlaps (m1∩m2 ≠ )∅ accesses from different threads (t1 ≠ t2) no common locks held (L1∩L2 = )∅ at least one write (a1 = Write a∨ 2 =
Write) may happen in parallel w.r.t. barriers (p1 || p2)⟹(s1, s2) is a potential data race pair
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
52
Differences and Challenges for UPC
Previous work on Java and pthreads programs Synchronization with locks and condition variables Single node
UPC has different programming model (SPMD) Large scale Bulk communication (memory regions) Non-blocking communication Collective operations with data movement Different memory consistency model
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
53
Previous work on Java and pthreads programs Synchronization with locks and condition variables Single node
UPC has different programming model (SPMD) Large scale Bulk communication (memory regions) Non-blocking communication Collective operations with data movement Different memory consistency model
Optimizations for scalability Distribute analysis and coalesce queries Efficient data structures for memory interval reasoning Reduce communication through filtering and sampling
Differences and Challenges for UPC
T1 T2
Shared access after wait
notify
wait
notify
wait
access
notify
wait
notify
wait
T1 T2
Shared access after notify
notifywait
notifywait
access
notifywait
notifywait
notify
wait
notify
wait
T1 T2
Shared access between barriers
barrier
barrier
barrier
barrier
access
barrier
barrier
barrier
barrier
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
54
Previous work on Java and pthreads programs Synchronization with locks and condition variables Single node
UPC has different programming model (SPMD) Large scale Bulk communication (memory regions) Non-blocking communication Collective operations with data movement Different memory consistency model
Optimizations for scalability Distribute analysis and coalesce queries Efficient data structures for memory interval reasoning Reduce communication through filtering and sampling
Differences and Challenges for UPC
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
55
Previous work on Java and pthreads programs Synchronization with locks and condition variables Single node
UPC has different programming model (SPMD) Large scale Bulk communication (memory regions) Non-blocking communication Collective operations with data movement Different memory consistency model
Optimizations for scalability Distribute analysis and coalesce queries Efficient data structures for memory interval reasoning Reduce communication through filtering and sampling
Differences and Challenges for UPC
(Extended) Weaker-than relation Only keep the least protected accesses Prune provably redundant accesses [Choi et al ’02] Also reduces superfluous race reports
e1 e⊑ 2 (access e1 is weaker-than e2) iff larger memory range (e1.m e⊇ 2.m) accessed by more threads (e1.t = * e∨ 1.t = e2.t) smaller lockset (e1.L e⊆ 2.L) weaker access type (e1.a = Write e∨ 1.a = e2.a)
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
56
Results on Single Node
* 4 threads on quad-core 2.66GHz CPU / 8GB RAM
Benchmark LoC RuntimeThrilleRacer ThrilleTester
Overhead Pot. race Overhead Conf. raceguppie 227 2.094s 12% 2 1.7% 2
knapsack 191 2.099s 14.9% 2 1.8% 2
lapalce 123 2.101s 16.3% 0 - -
mcop 358 2.183s 0.7% 0 - -
psearch 777 2.982s 1.8% 3 3.8% 2FT 2.3 2306 8.711s 6.1% 2 4.8% 2
CG 2.4 1939 3.812s 0.5% 0 - -
EP 2.4 763 10.02s 0.9% 0 - -
FT 2.4 2374 7.036s 0.1% 1 4.2% 1
IS 2.4 1449 3.073s 1.1% 0 - -
MG 2.4 2314 4.895s 3.1% 2 1.2% 2
BT 3.3 9626 48.78s 0.5% 8 0.8% 0LU 3.3 6311 37.05s 0.5% 0 - -
SP 3.3 5691 59.56s 0.2% 8 3.0% 0
Low overhead: < 20%
Unconfirmed bugs due to custom synchronization
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
57
Scalability Results on Franklin*
16 64 256
576 16 64 25
651
2 16 64 256
512 36 14
425
610
241
10
100
1000
NormalRacerTester
Spee
dup
BT LU MG SP
Class C Class D
* Cray XT4 Supercomputer at NERSCQuad-core 2.3GHz CPU / 8GB RAM per node / Portals interconnect
Maximum 8% slowdown at 8K cores
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
58
Bugs Found
In NPB 2.3 FT, Wrong lock allocation function causes real races in validation code Spurious validation failure errors
shared dcomplex *dbg_sum;static upc_lock_t *sum_write;
sum_write = upc_global_lock_alloc(); // wrong function
upc_lock (sum_write);{
dbg_sum->real = dbg_sum->real + chk.real;dbg_sum->imag = dbg_sum->imag + chk.imag;
}upc_unlock (sum_write);
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
59
Bugs Found
In SPLASH2 lu, Multiple initialization of vector without locks Different results on different executions Performance bug
void InitA(){ … for (j=0; j<n; j++) { for (i=0; i<n; i++) { rhs[i] += a[i+j*n]; // executed by all threads } }}
EECSElectrical Engineering and
Computer Sciences
BERKELEY PAR LAB
60
Conclusion
Need correctness tool support for HPC Scarcity of effective correctness tools
Our proposal: Active testing Combine dynamic analysis with testing Low overhead (<10%) Scalable (>8K cores) General algorithm: applicable to other programming models
• MPI, CUDA, OpenMP
http://upc.lbl.gov/thrillePGAS @ Booth 124