Upload
lydia-brooks
View
214
Download
0
Embed Size (px)
Citation preview
An efficient data race detector An efficient data race detector for DIOTAfor DIOTA
Michiel Ronsse, Bastiaan Stougie, Jonas Maebe, Frank Cornelis, Koen De Bosschere
Department of Electronics and Information Systems, Ghent University, BelgiumComputer Engineering Lab, Delft University of Technology, The Netherlands
Parco2003, September 2-5, Dresden
2
ContentsContents
Introduction Non-determinism & data races DIOTA On-the-fly data race detection using DIOTA
Method Implementation
Date Race Detection Example Experimental Evaluation Conclusions
3
IntroductionIntroduction Developing parallel programs for
multiprocessors with shared memory is considered difficult: number of threads running simultaneously co-operation & synchronisation through shared
memory Data races occur when:
two threads access the same shared variable (memory location) in an unsynchronised way and at least one thread modifies the variable
4
Example codeExample code
#include <pthread.h>
unsigned global=5;
thread2(){ global=global+6; }thread3(){ global=global+7; }
main(){pthread_t t2,t3;pthread_create(&t2, NULL, thread1, NULL);pthread_create(&t3, NULL, thread2, NULL);pthread_join(t2, NULL);pthread_join(t3, NULL);printf(“global=%d\n”, global);
}
5
Possible executionsPossible executions
L(5)
global=12 global=18global=11
L(5)
L(5)
L(5)
L(5)
L(11)S(11)
S(12) S(11)S(12)
S(11)
S(18)
+6 +7
+6
+7
+6
+7
6
Example code IIExample code II
#include <pthread.h>
unsigned global=5;
thread2(){lock(); global=global+6; unlock();}thread3(){lock(); global=global+7; unlock();}
main(){pthread_t t2,t3;pthread_create(&t2, NULL, thread1, NULL);pthread_create(&t3, NULL, thread2, NULL);pthread_join(t2, NULL);pthread_join(t3, NULL);printf(“global=%d\n”, global);
}
7
Detecting Data RacesDetecting Data Races
Automatic data races detection is possible collect all memory references check parallel references
Static methods: checking the source code for all possible
executions with all possible input values NP complete not feasible
Dynamic methods: detects data races during one particular execution
• post mortem (not feasible)• on-the-fly
8
Dynamic data race detectionDynamic data race detection
Piece of code between two consecutive synchronisation operations: a segment
We collect two sets for all segments a of all threads: L(a) and S(a) with the addresses of all load and store operations
For all parallel segments a and b,
gives the list of conflicting addresses.
(L(a)S(b)) (S(a)L(b)) (S(a)S(b))
9
Logical ClocksLogical Clocks
A logical clock C( ) attaches a timestamp C(a) to an event a
Used for tracing the causal order of events
Clock condition:
Clocks are strongly consistent if
)()( bCaCba
)()( bCaCba
10
Scalar ClocksScalar Clocks
Lamport Clocks Simple and fast update algorithm:
Provides only limited information:
1}:{max ii aSCbabSC
baabbSCaSC
babSCaSC
bababSCaSC
//
//
//
or
or
11
Scalar Clocks: exampleScalar Clocks: example
10 57
1112
15
13
1414
12
Vector ClocksVector Clocks
A vector clock for a program using N processes consists of N scalar values
Such a clock is strongly consistent
0,...,0,1,0,..,0}:{max ii aVCbabVC
baotherwise
abbVCaVC
babVCaVC
//
13
Vector Clocks: exampleVector Clocks: example
10,2,4 2,4,63,7,5
11,2,4
10,8,5
12,9,5
10,9,5
10,8,710,10,5
14
Vector Clocks: exampleVector Clocks: example
10,2,4 2,4,63,7,5
11,2,4
10,8,5
12,9,5
10,9,5
10,8,710,10,5
15
DIOTADIOTA DIOTA (Dynamic Instrumentation, Optimization and
Transformation of Applications) is a generic instrumentation tool
Backends use DIOTA to instrument memory intercept synchronisation functions ….
Deals correctly with data in code, code in data, self-modifying code
Clones processes: the original process is used for the data and the instrumented clone is used for the code
No need for recompilation, relinking or instrumentation of files.
16
Execution replayExecution replay
ROLT (Reconstruction of Lamport Timestamps) is used for tracing/replaying the synchronisation operations
Attaches a scalar Lamport timestamp to each synchronisation operation
Delaying synchronisation operations for operations with a smaller timestamp suffices for a correct replay
We only need to log a small subset of all operations
17
Collecting memory operationsCollecting memory operations
We need two lists of addresses per segment a: L(a) and S(a)
A multilevel bitmap is used takes spatiality into account low memory consumption comparing two bitmaps is easy
We lose information: two accesses to the same variable are counted once. This is however no problem for data race detection.
18
Multilevel Memory bitmapMultilevel Memory bitmap
9 bit 9 bit 14 bit
S(a)
19
Detecting parallel segmentsDetecting parallel segments
A vector timestamp is attached to each segment.
All segment information (two bitmaps+vector timestamps) is kept on a list L.
Each new segment is compared against the segments on list L.
20
Detecting obsolete segmentsDetecting obsolete segments Obsolete segments should be removed from list L as soon as possible.
An obsolete segment is a segment that can no longer be parallel with new segments.
We use snooped matrix clock in order to detect these segments.
21
Detecting obsolete segmentsDetecting obsolete segments
segments on list L
segments in execution
point of execution
the future
22
Detecting obsolete segmentsDetecting obsolete segments
segments on list L
obsolete segments
segments in execution
point of execution
the future
23
Comparing parallel segmentsComparing parallel segments
segments on list L
obsolete segments
segments in execution
point of execution
the future
24
OverviewOverview
Chooseinput
Record Replay+detect
Replay+ident.
Replay+debug
Replay+debug
Choosenew input
Theend
Automatic Requires user intervention
race
race
25
Experimental EvaluationExperimental Evaluation
Implementation for Linux running on Intel multiprocessors.
Tested on a dual 500MHz Celeron PC. SPLASH-2 was used as a benchmark
number of multithreaded numeric applications, such as fast fourier transform, a raytracer, ...
Several data races were found, including in SPLASH-2.
26
Performance of RecPlayPerformance of RecPlay Slowdown:
Memory consumption: <3.4x
normal diotaprogram exec. no instrument. memory instrum. data race detectionmozilla 7,50 35,00 (4,67x) 169,00 (22,53x) 401,00 (53,47x)
LU.cont -p4 8,06 9,59 (1,19x) 54,15 (6,72x) 85,74 (10,64x)
fft -p4 -m22 11,47 27,59 (2,41x) 200,37 (17,47x) 393,36 (34,29x)
radix -p4 -n41943046,96 11,74 (1,69x) 137,39 (19,74x) 244,18 (35,08x)
cholesky -p4 inputs/tk29.o10,43 12,84 (1,23x) 310,74 (29,79x) 581,97 (55,80x)
ocean -p4 -n51415,59 17,56 (1,13x) 339,06 (21,75x) 667,14 (42,79x)
radiosity -p 4 -batch -room27,50 90,14 (3,28x) 1157,45 (42,09x) 6805,61 (247,48x)
water-spatial < input430,70 52,51 (1,71x) 742,27 (24,18x) 1566,04 (51,01x)
27
ConclusionsConclusions
DIOTA is a practical and efficient tool for detecting and removing data races.
Three types of clocks (scalar, vector and matrix) are used to enable a fast and memory-efficient implementation.
Data races have been found.