Chang-Seo Park and Koushik Sen University of California Berkeley

EECSElectrical Engineering and

Computer Sciences

BERKELEY PAR LAB

P A R A L L E L C O M P U T I N G L A B O R A T O R Y


Computer Sciences BERKELEY PAR LAB

Efficient Data Race Detection for Distributed Memory Parallel Programs

CS267 Spring 2012 (3/8) Originally presented at Supercomputing 2011 (Seattle, WA)

Chang-Seo Park and Koushik Sen University of California Berkeley

Paul Hargove and Costin IancuLawrence Berkeley Laboratory


Computer Sciences

BERKELEY PAR LAB

2

Current State of Parallel Programming

Parallelism everywhere! Top supercomputer has 500K+ cores Quad-core standard on desktop / laptops Dual-core smartphones

Parallelism and concurrency make programming harder Scheduling non-determinism may cause subtle bugs

But, limited usage of testing and correctness tools We like hero programmers Hero programmers can find bugs (in sequential code) Tools are hard to find and use


Computer Sciences

BERKELEY PAR LAB

3

Outline

Introduction Example Bug and Motivation Efficient Data Race Detection with Active Testing

Prediction phase Confirmation phase

HOWTO: Primer on using UPC-Thrille Conclusion Q&A and Project Ideas


Computer Sciences

BERKELEY PAR LAB

4

Example Parallel Program

Simple matrix vector multiply

A b c = ×


Computer Sciences

BERKELEY PAR LAB

5

Example Parallel Program in UPC

1: void matvec(shared [N] double A[N][N],

shared double B[N],shared double C[N]) {

2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)

C A B

11

?? = 1 1

1 1


Computer Sciences

BERKELEY PAR LAB

6

Example Parallel Program in UPC

C A B

11

22 = 1 1

1 1





Computer Sciences

BERKELEY PAR LAB

7

UPC Example: Problem?

C A B

11

22 = 1 1

1 1




No apparent bug in this program.


Computer Sciences

BERKELEY PAR LAB

8




UPC Example: Data Race

No apparent bug in this program.But, if we call

matvec(A,B,B)?

Data Race!

B A B

11

11 = 1 1

1 1


Computer Sciences

BERKELEY PAR LAB

9


B A B

21

21 = 1 1

1 1




Data Race! No apparent bug in this program.But, if we call

matvec(A,B,B)?


Computer Sciences

BERKELEY PAR LAB

10


B A B

23

23 = 1 1

1 1




Data Race! No apparent bug in this program.But, if we call

matvec(A,B,B)?


Computer Sciences

BERKELEY PAR LAB

11

UPC Example: Trace

Example Trace: 4: sum[0] = 0;4: sum[1] = 0;6: sum[0]+= A[0]

[0]*B[0];6: sum[1]+= A[1]

[0]*B[0]; 6: sum[0]+= A[0]

[1]*B[1];6: sum[1]+= A[1]

[1]*B[1]; 9: B[0] = sum[0];9: B[1] = sum[1];




Data Race?


Computer Sciences

BERKELEY PAR LAB

12


[0]*B[0];6: sum[0]+= A[0]

[1]*B[1];6: sum[1]+= A[1]

[0]*B[0]; 9: B[0] = sum[0];6: sum[1]+= A[1]

[1]*B[1]; 9: B[1] = sum[1];

UPC Example: Trace




Would be nice to have a trace exhibiting the data race

Data Race!


Computer Sciences

BERKELEY PAR LAB

13

UPC Example: Trace




Would be nice to have a trace exhibiting the assertion failure


[0]*B[0];6: sum[0]+= A[0]

[1]*B[1];9: B[0] = sum[0];6: sum[1]+= A[1]

[0]*B[0]; 6: sum[1]+= A[1]

[1]*B[1]; 9: B[1] = sum[1];

Data Race!

C != A*B


Computer Sciences

BERKELEY PAR LAB

14

Desiderata

Would be nice to have a trace Showing a data race (or some other concurrency bug) Showing an assertion violation due to a data race (or some other

visible artifact)


Computer Sciences

BERKELEY PAR LAB

15

Active Testing

Would be nice to have a trace Showing a data race (or some other concurrency bug) Showing an assertion violation due to a data race (or some other

visible artifact) Leverage program analysis to make testing quickly find

real concurrency bugs Phase 1: Use imprecise static or dynamic program analysis to

find bug patterns where a potential concurrency bug can happen (Race Detector)

Phase 2: Directed testing to confirm potential bugs as real(Race Tester)


Computer Sciences

BERKELEY PAR LAB

16

Active Testing: Phase 1





Computer Sciences

BERKELEY PAR LAB

17




2: double sum[N]; 3: upc_forall(int i = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(int i = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i];10: }11:} // assert (C == A*B)1. Insert instrumentation at

compile time


Computer Sciences

BERKELEY PAR LAB

18


2. Run instrumented program normally and obtain trace

Generated Trace: 4: sum[0] = 0;4: sum[1] = 0;6: sum[0]+= A[0]

[0]*B[0];6: sum[1]+= A[1]

[0]*B[0]; 6: sum[0]+= A[0]

[1]*B[1];6: sum[1]+= A[1]

[1]*B[1]; 9: B[0] = sum[0];9: B[1] = sum[1];





Computer Sciences

BERKELEY PAR LAB

19


[0]*B[0];6: sum[1]+= A[1]

[0]*B[0]; 6: sum[0]+= A[0]

[1]*B[1];6: sum[1]+= A[1]

[1]*B[1]; 9: B[0] = sum[0];9: B[1] = sum[1];


3. Algorithm detects data races





Computer Sciences

BERKELEY PAR LAB

20


3. Potential race between statements 6 and 9


[0]*B[0];6: sum[1]+= A[1]

[0]*B[0]; 6: sum[0]+= A[0]

[1]*B[1];6: sum[1]+= A[1]

[1]*B[1]; 9: B[0] = sum[0];9: B[1] = sum[1];





Computer Sciences

BERKELEY PAR LAB

21


Goal 1. Confirm racesGoal 2. Create assertion failure





Computer Sciences

BERKELEY PAR LAB

22

Generate this execution:

4: sum[0] = 0;4: sum[1] = 0;6: sum[0]+= A[0][0]*B[0];6: sum[0]+= A[0][1]*B[1];9: B[0] = sum[0];6: sum[1]+= A[1][0]*B[0]; 6: sum[1]+= A[1][1]*B[1]; 9: B[1] = sum[1];


Data Race!

Goal 1. Confirm racesGoal 2. Create assertion failure





Computer Sciences

BERKELEY PAR LAB

23


Control scheduler knowing that (6,9) could race

Trace: 4: sum[1] = 0; 1: void matvec(shared [N] double A[N]

[N],shared double B[N],shared double C[N]) {



Computer Sciences

BERKELEY PAR LAB

24



Trace: 4: sum[1] = 0;4: sum[0] = 0;





Computer Sciences

BERKELEY PAR LAB

25



Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[1]+= A[1][0]*B[0];





Computer Sciences

BERKELEY PAR LAB

26



Trace: 4: sum[1] = 0;4: sum[0] = 0;

Postponed: { 6: sum[1]+= A[1][0]*B[0]; }

Do not postponeif there is a deadlock





Computer Sciences

BERKELEY PAR LAB

27


Control scheduler knowing that (6,9) could race Postponed: { 6: sum[1]+= A[1][0]*B[0]; }

Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[0]+= A[0][0]*B[0];






Computer Sciences

BERKELEY PAR LAB

28



Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[0]+= A[0][0]*B[0];6: sum[0]+= A[0][1]*B[1];


Postponed: { 6: sum[1]+= A[1][0]*B[0]; }





Computer Sciences

BERKELEY PAR LAB

29



Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[0]+= A[0][0]*B[0];6: sum[0]+= A[0][1]*B[1];9: B[0] = sum[0];

Postponed: { 6: sum[1]+= A[1][0]*B[0]; }





Computer Sciences

BERKELEY PAR LAB

30



Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[0]+= A[0][0]*B[0];6: sum[0]+= A[0][1]*B[1];9: B[0] = sum[0];

Postponed: { 6: sum[1]+= A[1][0]*B[0]; }




Race? yes


Computer Sciences

BERKELEY PAR LAB

31



Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[0]+= A[0][0]*B[0];6: sum[0]+= A[0][1]*B[1];

Postponed: {}




9:B[0]=sum[0]; 6:sum[1]+=A[1][0]*B[0];


Computer Sciences

BERKELEY PAR LAB

32


Achieved Goal 1:Confirmed race.

Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[0]+= A[0][0]*B[0];6: sum[0]+= A[0][1]*B[1];9: B[0] = sum[0];6: sum[1]+= A[1][0]*B[0];




Racing Statements Temporally Adjacent


Computer Sciences

BERKELEY PAR LAB

33





Trace: 4: sum[1] = 0;4: sum[0] = 0;6: sum[0]+= A[0][0]*B[0];6: sum[0]+= A[0][1]*B[1];9: B[0] = sum[0];6: sum[1]+= A[1][0]*B[0]; 6: sum[1]+= A[1][1]*B[1]; 9: B[1] = sum[1];

Achieved Goal 2:Assertion failure.

C != A*B


Computer Sciences

BERKELEY PAR LAB

34

UPC-Thrille

Thread Interposition Library and Lightweight Extensions Framework for active testing UPC programs

Instrument UPC source code at compile time Using macro expansions, add hooks for analyses

Phase 1: Race detector Observe execution and predict which accesses may potentially

have a data race Phase 2: Race tester

Re-execute program while controlling the scheduler to create actual data race scenarios predicted in phase 1


Computer Sciences

BERKELEY PAR LAB

35

UPC-Thrille

Extension to the Berkeley UPC compiler and runtime Unfortunately, disabled by default on NERSC clusters Fortunately, compilers with Thrille enabled are available

• /global/homes/p/parkcs/hopper/bin/thrille_upcc• /global/homes/p/parkcs/franklin/bin/thrille_upcc

You can also build Berkeley upcc with Thrille enabled from source by following the steps at http://upc.lbl.gov/thrille

Add switch “-thrille=[mode]” to (thrille enabled) upcc where [mode] is empty (default, no instrumentation) racer (phase 1 of race detection, predicts racy statement pairs) tester (phase 2, tries to create race on a given statement pair) Can also add “default_options = -thrille=[mode]” to ~/.upccrc

http://upc.lbl.gov/thrille


Computer Sciences

BERKELEY PAR LAB

36

-thrille=racer

UPC-Thrille

No changes needed to source file(s) Separate binary for each analysis phase

Including one “empty” uninstrumened version Run each phase with corresponding binary

hello.upc

a.outupcc

upcrunb.outupcc

-thrille=tester

c.outupcc


Computer Sciences

BERKELEY PAR LAB

37

UPC-Thrille: racer

-thrille=racer Finds potential data race pairs Records them in upct.race.N files (N=1,2,3,…)

Only works with static # of threads now (needs –T n) This limitation will be lifted soon

Example

$ upcc -T4 -thrille=racer matvec.upc -o matvec-racer$ upcrun matvec-racer (in an interactive batch job)…[2] Potential race #1 found:[2] Read from [0x3ff7004,0x3ff7008) by thread 2 at phase 4 (matvec.upc:17)[3] Write to [0x3ff7004,0x3ff7008) by thread 3 at phase 4 (matvec.upc:26)…


Computer Sciences

BERKELEY PAR LAB

38

UPC-Thrille: tester

-thrille=tester Confirms data races predicted in phase 1 Reads in upct.race.N files (N=1,2,3,…) and tests individually

A script upctrun is provided to automatically test all races and skip equivalent ones One could also test a specific race with env. UPCT_RACE_ID=n

Example$ upcc -T4 -thrille=tester matvec.upc -o matvec-tester$ upctrun matvec-tester…('matvec.upc:17', 'matvec.upc:26') : (8, 1, True)…

# of equivalent races

# pairs tested

True if race confirmed


Computer Sciences

BERKELEY PAR LAB

39

Limitations

Limitations of prediction phase Dynamic analysis can only analyze collected data Cannot predict races on parts of code that was not executed Cannot predict races on binary-only libraries whose source were

not instrumented Limitations of confirmation phase

Non-confirmation does not guarantee race freedom “Benign” data races


Computer Sciences

BERKELEY PAR LAB

40

Conclusion

Active testing for finding bugs in parallel programs Combines dynamic analysis with testing Observe executions for potential concurrency bugs Re-execute to confirm bugs

UPC-Thrille is a efficient, scalable, and extensible analysis framework for UPC Currently provides race detection analysis Other analyses in progress (class projects?)

http://upc.lbl.gov/[email protected]



Computer Sciences

BERKELEY PAR LAB

41


Computer Sciences

BERKELEY PAR LAB

42

Optimization 1: Distributed Checking

Minimize interaction between threads Store shared memory accesses locally At barrier boundary, send access information to respective

owner of memory Conflict checking distributed among threads

T1 T2

Shared access after wait

notify

wait

notify

wait

access

notify

wait

notify

wait

T1 T2

Shared access after notify

notifywait

notifywait

access

notifywait

notifywait

notify

wait

notify

wait

T1 T2

Shared access between barriers

barrier

barrier

barrier

barrier

access

barrier

barrier

barrier

barrier


Computer Sciences

BERKELEY PAR LAB

43

Optimization 2: Filter Redundancy

Information collected up to synchronization point may be redundant Reading and writing to same memory address Accessing same memory in different sizes or different locksets


Computer Sciences

BERKELEY PAR LAB

44

Optimization 2: Filter Redundancy

Information collected up to synchronization point may be redundant Reading and writing to same memory address Accessing same memory in different sizes or different locksets

(Extended) Weaker-than relation Only keep the least protected accesses Prune provably redundant accesses [Choi et al ’02] Also reduces superfluous race reports

e1 e⊑ 2 (access e1 is weaker-than e2) iff larger memory range (e1.m e⊇ 2.m) accessed by more threads (e1.t = * e∨ 1.t = e2.t) smaller lockset (e1.L e⊆ 2.L) weaker access type (e1.a = Write e∨ 1.a = e2.a)


Computer Sciences

BERKELEY PAR LAB

45

Optimization 3: Sampling

Scientific applications have tight loops Same computation and communication pattern each time Inefficient to check for races at every loop iteration

Reduce overhead by sampling Probabilistically sample each instrumentation point Reduce probability at each unsuccessful check Set probability to 0 when race found (disable check)


Computer Sciences

BERKELEY PAR LAB

46

How Well Does it Scale?

Maximum 8% slowdown at 8K cores Franklin Cray XT4 Supercomputer at NERSC Quad-core 2.3GHz CPU and 8GB RAM per node Portals interconnect

Optimizations for scalability Efficient Data Structures Minimize Communication Sampling with Exponential Backoff

T1 T2notify

wait

notify

wait

access

notifywait

notify

wait

T1 T2notify

wait

notifywait

access

notifywait

notifywait

notify

wait

notify

wait

T1 T2barrierbarrier

barrierbarrier

access

barrierbarrier

barrierbarrier


Computer Sciences

BERKELEY PAR LAB

47

Active Testing Cartoon: Phase I

Potential Collision


Computer Sciences

BERKELEY PAR LAB

48

Active Testing Cartoon: Phase II


Computer Sciences

BERKELEY PAR LAB

49

New Landscape for HPC

Shared memory for scalability and utilization Hybrid programming models: MPI + OpenMP PGAS: UPC, CAF, X10, etc.

Asynchronous access to shared data likely to cause bugs Unified Parallel C (UPC)

Parallel extensions to ISO C99 standard for shared and distributed memory hardware

Single Program Multiple Data (SPMD) +Partitioned Global Address Space (PGAS)

Shared memory concurrency Transparent access using pointers to shared data (array) Bulk transfers with memcpy, memput, memget Fine-grained (lock) and bulk (barrier) synchronization


Computer Sciences

BERKELEY PAR LAB

50

Phase 1: Checking for Conflicts

To predict possible races, Need to check all shared accesses for conflicts Collect information through instrumentation


Computer Sciences

BERKELEY PAR LAB

51

Phase 1: Checking for Conflicts

To predict possible races, Need to check all shared accesses for conflicts Collect information through instrumentation

Two accesses e1 = (m1, t1, L1, a1, p1, s1) and e2 = (m2, t2, L2, a2, p2, s2) are in conflict when memory range overlaps (m1∩m2 ≠ )∅ accesses from different threads (t1 ≠ t2) no common locks held (L1∩L2 = )∅ at least one write (a1 = Write a∨ 2 =

Write) may happen in parallel w.r.t. barriers (p1 || p2)⟹(s1, s2) is a potential data race pair


Computer Sciences

BERKELEY PAR LAB

52

Differences and Challenges for UPC

Previous work on Java and pthreads programs Synchronization with locks and condition variables Single node

UPC has different programming model (SPMD) Large scale Bulk communication (memory regions) Non-blocking communication Collective operations with data movement Different memory consistency model


Computer Sciences

BERKELEY PAR LAB

53



Optimizations for scalability Distribute analysis and coalesce queries Efficient data structures for memory interval reasoning Reduce communication through filtering and sampling


T1 T2

Shared access after wait

notify

wait

notify

wait

access

notify

wait

notify

wait

T1 T2

Shared access after notify

notifywait

notifywait

access

notifywait

notifywait

notify

wait

notify

wait

T1 T2

Shared access between barriers

barrier

barrier

barrier

barrier

access

barrier

barrier

barrier

barrier


Computer Sciences

BERKELEY PAR LAB

54






Computer Sciences

BERKELEY PAR LAB

55





(Extended) Weaker-than relation Only keep the least protected accesses Prune provably redundant accesses [Choi et al ’02] Also reduces superfluous race reports

e1 e⊑ 2 (access e1 is weaker-than e2) iff larger memory range (e1.m e⊇ 2.m) accessed by more threads (e1.t = * e∨ 1.t = e2.t) smaller lockset (e1.L e⊆ 2.L) weaker access type (e1.a = Write e∨ 1.a = e2.a)


Computer Sciences

BERKELEY PAR LAB

56

Results on Single Node

* 4 threads on quad-core 2.66GHz CPU / 8GB RAM

Benchmark LoC RuntimeThrilleRacer ThrilleTester

Overhead Pot. race Overhead Conf. raceguppie 227 2.094s 12% 2 1.7% 2

knapsack 191 2.099s 14.9% 2 1.8% 2

lapalce 123 2.101s 16.3% 0 - -

mcop 358 2.183s 0.7% 0 - -

psearch 777 2.982s 1.8% 3 3.8% 2FT 2.3 2306 8.711s 6.1% 2 4.8% 2

CG 2.4 1939 3.812s 0.5% 0 - -

EP 2.4 763 10.02s 0.9% 0 - -

FT 2.4 2374 7.036s 0.1% 1 4.2% 1

IS 2.4 1449 3.073s 1.1% 0 - -

MG 2.4 2314 4.895s 3.1% 2 1.2% 2

BT 3.3 9626 48.78s 0.5% 8 0.8% 0LU 3.3 6311 37.05s 0.5% 0 - -

SP 3.3 5691 59.56s 0.2% 8 3.0% 0

Low overhead: < 20%

Unconfirmed bugs due to custom synchronization


Computer Sciences

BERKELEY PAR LAB

57

Scalability Results on Franklin*

16 64 256

576 16 64 25

651

2 16 64 256

512 36 14

425

610

241

10

100

1000

NormalRacerTester

Spee

dup

BT LU MG SP

Class C Class D

* Cray XT4 Supercomputer at NERSCQuad-core 2.3GHz CPU / 8GB RAM per node / Portals interconnect

Maximum 8% slowdown at 8K cores


Computer Sciences

BERKELEY PAR LAB

58

Bugs Found

In NPB 2.3 FT, Wrong lock allocation function causes real races in validation code Spurious validation failure errors

shared dcomplex *dbg_sum;static upc_lock_t *sum_write;

sum_write = upc_global_lock_alloc(); // wrong function

upc_lock (sum_write);{

dbg_sum->real = dbg_sum->real + chk.real;dbg_sum->imag = dbg_sum->imag + chk.imag;

}upc_unlock (sum_write);


Computer Sciences

BERKELEY PAR LAB

59

Bugs Found

In SPLASH2 lu, Multiple initialization of vector without locks Different results on different executions Performance bug

void InitA(){ … for (j=0; j<n; j++) { for (i=0; i<n; i++) { rhs[i] += a[i+j*n]; // executed by all threads } }}


Computer Sciences

BERKELEY PAR LAB

60

Conclusion

Need correctness tool support for HPC Scarcity of effective correctness tools

Our proposal: Active testing Combine dynamic analysis with testing Low overhead (<10%) Scalable (>8K cores) General algorithm: applicable to other programming models

• MPI, CUDA, OpenMP

http://upc.lbl.gov/thrillePGAS @ Booth 124


Documents

Chang-Seo Park and Koushik Sen University of California Berkeley