View
220
Download
0
Embed Size (px)
Citation preview
Inspect, ISP, and FIB:reduction-based verification and analysis tools
for concurrent programs
Research Group: Yu Yang, Xiaofang Chen, Sarvani Vakkalanka, Subodh Sharma, Anh Vo, Michael DeLisi, Geof SawayaFaculty: Ganesh Gopalakrishnan (speaker), and Robert M. Kirby
School of Computing, University of Utah, Salt Lake City, UT
[email protected] http://www.cs.utah.edu/formal_verification
Talk at MSR India, Bangalore Research Labs, June 6, 2008
Supported by Microsoft HPC Center Grant, NSF CNS-0509379, SRC TJ 1318
2
Multicores are the future! Need to employ / teach concurrent programming at an unprecedented scale!
(photo courtesy of Intel
Corporation.)
Some of today’s proposals:
Threads (various)
Message Passing (various)
Transactional Memory (various)
OpenMP
MPI
Intel’s Ct
Microsoft’s Parallel Fx
Cilk Arts’s Cilk
Intel’s TBB
Nvidia’s Cuda
…
3
Q: What tool feature is desired
for any concurrency approach ?
Threads
Message Passing
Transactional Memory
OpenMP
MPI
Ct
Parallel Fx
Cilk
TBB
Cuda
…
4
Q: What tool feature is desired
for any concurrency approach ?
Threads
Message Passing
Transactional Memory
OpenMP
MPI
Ct
Parallel Fx
Cilk
TBB
Cuda
…
A: The ability to verify over all RELEVANT
interleavings !
Will have different “grain” sizes
(notions of atomicity)
Different types of interactions
between “threads / processes”
Different kinds of bugs
-- deadlocks
-- data races
-- communication races
-- memory leaks
Yet, the basics of verification remains
achieving the effect of having examined
all possible interleavings by only
exploring representative interleavings
5
An exponential number of interleavings…Need sound criteria for reductions (e.g. POR).
0: 1: 2: 3: 4: 5:
Card Deck 0 Card Deck 1
0: 1: 2: 3: 4: 5:
• Suppose only the interleavings of the red cards matter
• Then don’t try all riffle-shuffles (12!) / ((6!) (6!)) = 924
• Just do TWO shuffles !!
6
The Growth of (n.p)! / (n!)p
1: 2: 3: 4: …n:
Thread 1 …. Thread p
1: 2: 3: 4: …n:
• Unity / Murphi “guard / action” rules : n=1, p=R R! interleavings
• p = 3, n = 5 106 interleavings
• p = 3, n = 6 17 * 106 interleavings
• p = 4, n = 5 1010 interleavings
The situation is worse, because each statement (card) produces different state transformations…
7
Ad-hoc Testing is INEFFECTIVE for thread verification !
1: 2: 3: 4: …n:
Thread 1 …. Thread p
1: 2: 3: 4: …n:
8
Ad-hoc Testing is INEFFECTIVE for thread verification !
1: 2: 3: 4: …n:
Thread 1 …. Thread p
1: 2: 3: 4: …n:
Need Sound and Practically Justifiable Reduction Techniques !!
9
Growing Need to Verify Real-world Concurrent Programs
One often discovers what one is doing through programming!–Need a safety net !
‘Correct by construction’ methods work only around stable ideas– Multicore programming not there yet
Tools needed for TODAY’s problems– Inspect is one such tool
10
The need for dynamic verification: Too many realities – e.g. the code eliminated may have the bug
#include <stdlib.h> // Dining Philosophers with no deadlock#include <pthread.h> // all phils but "odd" one pickup their#include <stdio.h> // left fork first; odd phil picks#include <string.h> // up right fork first#include <malloc.h>#include <errno.h>#include <sys/types.h>#include <assert.h>
#define NUM_THREADS 3
pthread_mutex_t mutexes[NUM_THREADS];pthread_cond_t conditionVars[NUM_THREADS];int permits[NUM_THREADS];pthread_t tids[NUM_THREADS];
int data = 0;
void * Philosopher(void * arg){ int i; i = (int)arg;
// pickup left fork pthread_mutex_lock(&mutexes[i%NUM_THREADS]); while (permits[i%NUM_THREADS] == 0) { printf("P%d : tryget F%d\n", i, i%NUM_THREADS); pthread_cond_wait(&conditionVars[i%NUM_THREADS],&mutexes[i%NUM_THREADS]); }
11
Threads are too low level – Ed Lee – “The Problem with Threads”
» Complex global effects
» Semantics are not compositional
» Non-robustness against failure
Yet, alternative proposals are in a state of flux– OpenMP
– Transactional Memory
– …
Will need threads to IMPLEMENT alternate proposals!
Need to support dynamic verification of thread programs
12
Each parallel programming API class / approach seems to warrant its own dynamic verification approach
May be possible to employ common instrumentation / replay mechanisms in the long run
As of now, one has to build customized implementations
– We have implementations for Threads (Inspect) and MPI (ISP)
– We are building an implementation for OpenMP (based on a backtrackable version of OMPi from Greece)
Speaking in general
13
(BlueGene/L - Image courtesy of IBM / LLNL) (Image courtesy of Steve Parker, CSAFE, Utah)
Reason for our interest in the Message Passing Interface
• MPI is the de facto standard for programming clusters
A large API with over 300 functions, widely supported
A custom-made dynamic verification approach is needed
The need for FV solutions is acutely felt in this area…
14
The success of MPI over many apps (Courtesy of Al Geist, EuroPVM / MPI 2007)
15
MPI is complex …
– Send
– Receive
– Send / Receive
– Send / Receive / Replace
– Broadcast
– Barrier
– Reduce
– Rendezvous mode
– Blocking mode
– Non-blocking mode
– Reliance on system buffering
– User-attached buffering
– Restarts/Cancels of MPI Operations
– Non Wildcard receives
– Wildcard receives
– Tag matching
– Communication spaces
An MPI program is an interesting (and legal)combination of elementsfrom these spaces
16
MPI is complex …
– Send
– Receive
– Send / Receive
– Send / Receive / Replace
– Broadcast
– Barrier
– Reduce
– Rendezvous mode
– Blocking mode
– Non-blocking mode
– Reliance on system buffering
– User-attached buffering
– Restarts/Cancels of MPI Operations
– Non Wildcard receives
– Wildcard receives
– Tag matching
– Communication spaces
An MPI program is an interesting (and legal)combination of elementsfrom these spaces
Yet, the complexity seems unavoidable to succeed at the scale of MPI’s deployment…
17
MPI is complex …
– Send
– Receive
– Send / Receive
– Send / Receive / Replace
– Broadcast
– Barrier
– Reduce
– Rendezvous mode
– Blocking mode
– Non-blocking mode
– Reliance on system buffering
– User-attached buffering
– Restarts/Cancels of MPI Operations
– Non Wildcard receives
– Wildcard receives
– Tag matching
– Communication spaces
An MPI program is an interesting (and legal)combination of elementsfrom these spaces
We have defined a formal semantics for 150 / 300 MPI functionsin TLA+ (soon to try other notations – e.g. SAL)
18
Automated verification of common mistakes
– Deadlocks
– Communication Races
– Resource Leaks
Our approach with MPI: go after “low hanging bugs”
19
Deadlock pattern…
04/18/23
P0 P1--- ---
s(P1); s(P0);
r(P1); r(P0);
P0 P1
--- ---
Bcast; Barrier;
Barrier; Bcast;
20
Communication Race Pattern…
04/18/23
P0 P1 P2--- --- ---r(*); s(P0); s(P0);
r(P1);
P0 P1 P2--- --- ---r(*); s(P0); s(P0);
r(P1);
OK
NOK
21
Resource Leak Pattern…
04/18/23
P0---some_allocation_op(&handle);
FORGOTTEN DEALLOC !!
22
Why is even this much debugging hard?
The “crooked barrier” quiz will show you why…
P0---
MPI_Isend ( P2 )
MPI_Barrier
P1---
MPI_Barrier
MPI_Isend( P2 )
P2---
MPI_Irecv ( ANY )
MPI_Barrier
Will P1’s Send Match P2’s Receive ?
23
P0---
MPI_Isend ( P2 )
MPI_Barrier
P1---
MPI_Barrier
MPI_Isend( P2 )
P2---
MPI_Irecv ( ANY )
MPI_Barrier
It will ! Here is the animation
MPI Behavior
The “crooked barrier” quiz
24
P0---
MPI_Isend ( P2 )
MPI_Barrier
P1---
MPI_Barrier
MPI_Isend( P2 )
P2---
MPI_Irecv ( ANY )
MPI_Barrier
MPI Behavior
The “crooked barrier” quiz
25
P0---
MPI_Isend ( P2 )
MPI_Barrier
P1---
MPI_Barrier
MPI_Isend( P2 )
P2---
MPI_Irecv ( ANY )
MPI_Barrier
MPI Behavior
The “crooked barrier” quiz
26
P0---
MPI_Isend ( P2 )
MPI_Barrier
P1---
MPI_Barrier
MPI_Isend( P2 )
P2---
MPI_Irecv ( ANY )
MPI_Barrier
MPI Behavior
The “crooked barrier” quiz
27
P0---
MPI_Isend ( P2 )
MPI_Barrier
P1---
MPI_Barrier
MPI_Isend( P2 )
P2---
MPI_Irecv ( ANY )
MPI_Barrier
MPI Behavior
The “crooked barrier” quiz
28
P0---
MPI_Isend ( P2 )
MPI_Barrier
P1---
MPI_Barrier
MPI_Isend( P2 )
P2---
MPI_Irecv ( ANY )
MPI_Barrier
MPI Behavior
The “crooked barrier” quiz
29
P0---
MPI_Isend ( P2 )
MPI_Barrier
P1---
MPI_Barrier
MPI_Isend( P2 )
P2---
MPI_Irecv ( ANY )
MPI_Barrier
MPI Behavior
The “crooked barrier” quiz
We need a dynamic verification approach to be aware of the details of the API behavior…
30
Another motivating example:
we should not multiply out the interleavings of P0-P2 against those of P3-P5
P0---
MPI_Isend ( P2 )
MPI_Barrier
P1---
MPI_Barrier
MPI_Isend( P2 )
P2---
MPI_Irecv ( * )
MPI_Barrier
P3---
MPI_Isend ( P5 )
MPI_Barrier
P4---
MPI_Barrier
MPI_Isend( P5 )
P5---
MPI_Irecv ( * )
MPI_Barrier
31
Results pertaining to Inspect(to be presented at SPIN 2008)
32
Inspect Workflow
Multithreaded C/C++ program
Multithreaded C/C++ program
instrumented program
instrumented program
instrumentation
Thread library wrapper
Thread library wrapper
compile
executableexecutable
thread 1
thread n
schedulerrequest/permit
request/permit
33
Overview
Scheduleraction request
thread
permission
Unix domain sockets
Message Buffer
State stack
DPOR
Unix domain sockets
Visible operation interceptor
Program under test
34
Message Types
Thread creation/termination messages Visible operation request
– acquire/release locks
– wait for/send signals
– read/write shared object
Other helper messages– local state changes
– ...
35
Overview of the source transformation
Inter-procedural Flow-sensitive Alias Analysis
Thread Escape Analysis
Source code transformation
Intra-procedural Dataflow Analysis
Instrumented Program
Multithreaded C Program
36
Source code transformation (1)
functions calls to the thread library routine
functions calls to the Inspect library wrapper
pthread_create
inspect_thread_create
In detail:
pthread_mutex_lock
inspect_mutex_lock
…
37
Source code transformation (2)
x = rhs;
write_shared_xxx(&x, rhs);…..void write_shared_xxx(type * addr, type val){ inspect_obj_write(addr); *addr = val;}
lhs = x;
read_shared_xxx(&lhs, &x);…..void read_shared_xxx(type * lhs, type * addr){ inspect_obj_read(addr); *lhs = *addr;}
38
Source Transformation (3)
thread_routine(…){
…
}
thread_routine(…){ inspect_thread_begin();
…
inspect_thread_end();}
39
Source Transformation (4)
visible operation 1
…
visible operation 2
visible operation 1
…inspect_local_changes(….)
visible operation 2
40
Result of instrumentation
void * Philosopher(void * arg){ int i; i = (int)arg; ... pthread_mutex_lock(&mutexes[i%3]); ... while (permits[i%3] == 0) { printf("P%d : tryget F%d\n", i, i%3); pthread_cond_wait(...); } ... permits[i%3] = 0; ... pthread_cond_signal(&conditionVars[i%3]); pthread_mutex_unlock(&mutexes[i%3]); return NULL;}
void *Philosopher(void *arg ) { int i ; pthread_mutex_t *tmp ; { inspect_thread_start("Philosopher"); i = (int )arg; tmp = & mutexes[i % 3]; … inspect_mutex_lock(tmp); … while (1) { __cil_tmp43 = read_shared_0(& permits[i % 3]); if (! __cil_tmp32) { break; } __cil_tmp33 = i % 3; … tmp___0 = __cil_tmp33; … inspect_cond_wait(...); } ... write_shared_1(& permits[i % 3], 0); ... inspect_cond_signal(tmp___25); ... inspect_mutex_unlock(tmp___26); ... inspect_thread_end(); return (__retres31);}
41
Philosophers in PThreads…
#include <stdlib.h> // Dining Philosophers with no deadlock#include <pthread.h> // all phils but "odd" one pickup their#include <stdio.h> // left fork first; odd phil picks#include <string.h> // up right fork first#include <malloc.h>#include <errno.h>#include <sys/types.h>#include <assert.h>
#define NUM_THREADS 3
pthread_mutex_t mutexes[NUM_THREADS];pthread_cond_t conditionVars[NUM_THREADS];int permits[NUM_THREADS];pthread_t tids[NUM_THREADS];
int data = 0;
void * Philosopher(void * arg){ int i; i = (int)arg;
// pickup left fork pthread_mutex_lock(&mutexes[i%NUM_THREADS]); while (permits[i%NUM_THREADS] == 0) { printf("P%d : tryget F%d\n", i, i%NUM_THREADS); pthread_cond_wait(&conditionVars[i%NUM_THREADS],&mutexes[i%NUM_THREADS]); }
permits[i%NUM_THREADS] = 0; printf("P%d : get F%d\n", i, i%NUM_THREADS); pthread_mutex_unlock(&mutexes[i%NUM_THREADS]);
// pickup right fork pthread_mutex_lock(&mutexes[(i+1)%NUM_THREADS]); while (permits[(i+1)%NUM_THREADS] == 0) { printf("P%d : tryget F%d\n", i, (i+1)%NUM_THREADS); pthread_cond_wait(&conditionVars[(i+1)%NUM_THREADS],&mutexes[(i+1)%NUM_THREADS]); } permits[(i+1)%NUM_THREADS] = 0; printf("P%d : get F%d\n", i, (i+1)%NUM_THREADS); pthread_mutex_unlock(&mutexes[(i+1)%NUM_THREADS]);
//printf("philosopher %d thinks \n",i); printf("%d\n", i);
// data = 10 * data + i;
fflush(stdout);
// putdown right fork pthread_mutex_lock(&mutexes[(i+1)%NUM_THREADS]); permits[(i+1)%NUM_THREADS] = 1; printf("P%d : put F%d\n", i, (i+1)%NUM_THREADS); pthread_cond_signal(&conditionVars[(i+1)%NUM_THREADS]); pthread_mutex_unlock(&mutexes[(i+1)%NUM_THREADS]);
42
…Philosophers in PThreads // putdown left fork pthread_mutex_lock(&mutexes[i%NUM_THREADS]); permits[i%NUM_THREADS] = 1; printf("P%d : put F%d \n", i, i%NUM_THREADS); pthread_cond_signal(&conditionVars[i%NUM_THREADS]); pthread_mutex_unlock(&mutexes[i%NUM_THREADS]);
// putdown right fork pthread_mutex_lock(&mutexes[(i+1)%NUM_THREADS]); permits[(i+1)%NUM_THREADS] = 1; printf("P%d : put F%d \n", i, (i+1)%NUM_THREADS); pthread_cond_signal(&conditionVars[(i+1)%NUM_THREADS]); pthread_mutex_unlock(&mutexes[(i+1)%NUM_THREADS]);
return NULL;
}
int main(){ int i;
for (i = 0; i < NUM_THREADS; i++) pthread_mutex_init(&mutexes[i], NULL); for (i = 0; i < NUM_THREADS; i++) pthread_cond_init(&conditionVars[i], NULL); for (i = 0; i < NUM_THREADS; i++) permits[i] = 1;
for (i = 0; i < NUM_THREADS-1; i++){ pthread_create(&tids[i], NULL, Philosopher, (void*)(i) ); }
pthread_create(&tids[NUM_THREADS-1], NULL, OddPhilosopher, (void*)(NUM_THREADS-1) );
for (i = 0; i < NUM_THREADS; i++){ pthread_join(tids[i], NULL); }
for (i = 0; i < NUM_THREADS; i++){ pthread_mutex_destroy(&mutexes[i]); } for (i = 0; i < NUM_THREADS; i++){ pthread_cond_destroy(&conditionVars[i]); }
//printf(" data = %d \n", data);
//assert( data != 201); return 0;}
43
‘Plain run’ of Philosophers
gcc -g -O3 -o nobug examples/Dining3.c -L ./lib -lpthread -lstdc++ -lssl % time nobug
P0 : get F0P0 : get F10P0 : put F1P0 : put F0P1 : get F1P1 : get F21P1 : put F2P1 : put F1P2 : get F0 P2 : get F2 2P2 : put F2 P2 : put F0
real 0m0.075suser 0m0.001ssys 0m0.008s
44
…Buggy Philosophers in PThreads // putdown left fork pthread_mutex_lock(&mutexes[i%NUM_THREADS]); permits[i%NUM_THREADS] = 1; printf("P%d : put F%d \n", i, i%NUM_THREADS); pthread_cond_signal(&conditionVars[i%NUM_THREADS]); pthread_mutex_unlock(&mutexes[i%NUM_THREADS]);
// putdown right fork pthread_mutex_lock(&mutexes[(i+1)%NUM_THREADS]); permits[(i+1)%NUM_THREADS] = 1; printf("P%d : put F%d \n", i, (i+1)%NUM_THREADS); pthread_cond_signal(&conditionVars[(i+1)%NUM_THREADS]); pthread_mutex_unlock(&mutexes[(i+1)%NUM_THREADS]);
return NULL;
}
int main(){ int i;
for (i = 0; i < NUM_THREADS; i++) pthread_mutex_init(&mutexes[i], NULL); for (i = 0; i < NUM_THREADS; i++) pthread_cond_init(&conditionVars[i], NULL); for (i = 0; i < NUM_THREADS; i++) permits[i] = 1;
for (i = 0; i < NUM_THREADS-1; i++){ pthread_create(&tids[i], NULL, Philosopher, (void*)(i) ); }
pthread_create(&tids[NUM_THREADS-1], NULL, Philosopher, (void*)(NUM_THREADS-1) );
for (i = 0; i < NUM_THREADS; i++){ pthread_join(tids[i], NULL); }
for (i = 0; i < NUM_THREADS; i++){ pthread_mutex_destroy(&mutexes[i]); } for (i = 0; i < NUM_THREADS; i++){ pthread_cond_destroy(&conditionVars[i]); }
//printf(" data = %d \n", data);
//assert( data != 201); return 0;}
45
‘Plain run’ of buggy philosopher .. bugs missed by testing
gcc -g -O3 -o buggy examples/Dining3Buggy.c -L ./lib -lpthread -lstdc++ -lssl % time buggy
P0 : get F0P0 : get F10P0 : put F1P0 : put F0P1 : get F1P1 : get F21P1 : put F2P1 : put F1P2 : get F2P2 : get F02P2 : put F0P2 : put F2
real 0m0.084suser 0m0.002ssys 0m0.011s
46
Jiggling Schedule in Buggy Philosopher..
#include <stdlib.h> // Dining Philosophers with no deadlock#include <pthread.h> // all phils but "odd" one pickup their#include <stdio.h> // left fork first; odd phil picks#include <string.h> // up right fork first#include <malloc.h>#include <errno.h>#include <sys/types.h>#include <assert.h>
#define NUM_THREADS 3
pthread_mutex_t mutexes[NUM_THREADS];pthread_cond_t conditionVars[NUM_THREADS];int permits[NUM_THREADS];pthread_t tids[NUM_THREADS];
int data = 0;
void * Philosopher(void * arg){ int i; i = (int)arg;
// pickup left fork pthread_mutex_lock(&mutexes[i%NUM_THREADS]); while (permits[i%NUM_THREADS] == 0) { printf("P%d : tryget F%d\n", i, i%NUM_THREADS); pthread_cond_wait(&conditionVars[i%NUM_THREADS],&mutexes[i%NUM_THREADS]); }
permits[i%NUM_THREADS] = 0; printf("P%d : get F%d\n", i, i%NUM_THREADS); pthread_mutex_unlock(&mutexes[i%NUM_THREADS]);
nanosleep (0) added here
// pickup right fork pthread_mutex_lock(&mutexes[(i+1)%NUM_THREADS]); while (permits[(i+1)%NUM_THREADS] == 0) { printf("P%d : tryget F%d\n", i, (i+1)%NUM_THREADS); pthread_cond_wait(&conditionVars[(i+1)%NUM_THREADS],&mutexes[(i+1)%NUM_THREADS]); } permits[(i+1)%NUM_THREADS] = 0; printf("P%d : get F%d\n", i, (i+1)%NUM_THREADS); pthread_mutex_unlock(&mutexes[(i+1)%NUM_THREADS]);
//printf("philosopher %d thinks \n",i); printf("%d\n", i);
// data = 10 * data + i;
fflush(stdout);
// putdown right fork pthread_mutex_lock(&mutexes[(i+1)%NUM_THREADS]); permits[(i+1)%NUM_THREADS] = 1; printf("P%d : put F%d\n", i, (i+1)%NUM_THREADS); pthread_cond_signal(&conditionVars[(i+1)%NUM_THREADS]); pthread_mutex_unlock(&mutexes[(i+1)%NUM_THREADS]);
47
‘Plain runs’ of buggy philosopher – bug still very dodgy …
gcc -g -O3 -o buggynsleep examples/Dining3BuggyNanosleep0.c -L ./lib -lpthread -lstdc++ -lssl % buggysleep
P0 : get F0P0 : sleeping 0 nsP1 : get F1P1 : sleeping 0 nsP2 : get F2P2 : sleeping 0 nsP0 : tryget F1P2 : tryget F0P1 : tryget F2
buggysleep
P0 : get F0P0 : sleeping 0 nsP0 : get F10P0 : put F1P0 : put F0P1 : get F1P1 : sleeping 0 nsP2 : get F2P2 : sleeping 0 nsP1 : tryget F2P2 : get F02P2 : put F0P2 : put F2P1 : get F21P1 : put F2P1 : put F1
First run deadlocked – second did not ..
48
Inspect of nonbuggy and buggy Philosophers ..
./instrument file.c
./compile file.instr.c
./inspect ./target
P0 : get F0P0 : get F10P0 : put F1P0 : put F0P1 : get F1P1 : get F21P1 : put F2P1 : put F1P2 : get F0 P2 : get F2 2P2 : put F2 P2 : put F0 num of threads = 1 === run 2 ===P0 : get F0
...
P1 : put F1
…=== run 48 ===P2 : get F0 P2 : get F2 2P2 : put F2 P2 : put F0 P0 : get F0P0 : get F10P1 : tryget F1<< Total number of runs: 48, Transitions explored: 1814Used time (seconds): 7.999327
=== run 1 ===P0 : get F0P0 : get F10P0 : put F1P0 : put F0P1 : get F1P1 : get F21P1 : put F2P1 : put F1P2 : get F2P2 : get F02P2 : put F0P2 : put F2 === run 2 ===P0 : get F0P0 : get F10P0 : put F1P0 : put F0P1 : get F1P1 : get F21P2 : tryget F2P1 : put F2P1 : put F1
=== run 28 ===P0 : get F0P1 : get F1P0 : tryget F1P2 : get F2P1 : tryget F2P2 : tryget F0Found a deadlock!!(0, thread_start) (0, mutex_init, 5) (0, mutex_init, 6) (0, mutex_init, 7) (0, cond_init, 8) (0, cond_init, 9) (0, cond_init, 10) (0, obj_write, 2) (0, obj_write, 3) (0, obj_write, 4) (0, thread_create, 1) (0, thread_create, 2) (0, thread_create, 3) (1, mutex_lock, 5) (1, obj_read, 2) (1, obj_write, 2) (1, mutex_unlock, 5) (2, mutex_lock, 6) (2, obj_read, 3)
(2, obj_write, 3) (2, mutex_unlock, 6) (1, mutex_lock, 6) (1, obj_read, 3) (1, mutex_unlock, 6) (3, mutex_lock, 7) (3, obj_read, 4) (3, obj_write, 4) (3, mutex_unlock, 7) (2, mutex_lock, 7) (2, obj_read, 4) (2, mutex_unlock, 7) (3, mutex_lock, 5) (3, obj_read, 2) (3, mutex_unlock, 5) (-1, unknown)
Total number of runs: 29, killed-in-the-middle runs: 4Transitions explored: 1193Used time (seconds): 5.990523
49
The Growth of (n.p)! / (n!)p for Diningp.c
• Diningp.c has n = 4 (roughly)
• p = 3 : We get 34,650 (loose upper-bound) versus 48 with DPOR
• p = 5 : We get 305,540,235,000 versus 2,375 with DPOR
• DPOR really works well in reducing the number of interleavings !!
• Testing will have to exhibit its cleverness among 3 * 1011 interleavings
50
On the HUGE importance of DPOR[ NEW SLIDE ]
void * thread_A(void* arg){ pthread_mutex_lock(&mutex); A_count++; pthread_mutex_unlock(&mutex);}
void * thread_B(void * arg){ pthread_mutex_lock(&lock); B_count++; pthread_mutex_unlock(&lock);}
BEFORE INSTRUMENTATIONvoid *thread_A(void *arg ) // thread_B is similar{ void *__retres2 ; int __cil_tmp3 ; int __cil_tmp4 ;
{ inspect_thread_start("thread_A"); inspect_mutex_lock(& mutex); __cil_tmp4 = read_shared_0(& A_count); __cil_tmp3 = __cil_tmp4 + 1; write_shared_1(& A_count, __cil_tmp3); inspect_mutex_unlock(& mutex); __retres2 = (void *)0; inspect_thread_end(); return (__retres2);}}
AFTER INSTRUMENTATION (transitions are shown as bands)
51
On the HUGE importance of DPOR[ NEW SLIDE ]
void * thread_A(void* arg){ pthread_mutex_lock(&mutex); A_count++; pthread_mutex_unlock(&mutex);}
void * thread_B(void * arg){ pthread_mutex_lock(&lock); B_count++; pthread_mutex_unlock(&lock);}
BEFORE INSTRUMENTATIONvoid *thread_A(void *arg ) // thread_B is similar{ void *__retres2 ; int __cil_tmp3 ; int __cil_tmp4 ;
{ inspect_thread_start("thread_A"); inspect_mutex_lock(& mutex); __cil_tmp4 = read_shared_0(& A_count); __cil_tmp3 = __cil_tmp4 + 1; write_shared_1(& A_count, __cil_tmp3); inspect_mutex_unlock(& mutex); __retres2 = (void *)0; inspect_thread_end(); return (__retres2);}}
AFTER INSTRUMENTATION (transitions are shown as bands)
• ONE interleaving with DPOR• 252 = (10!) / (5!)2 without DPOR
52
Obtaining and Running Inspect (Linux) http://www.cs.utah.edu/~yuyang/inspect May need to obtain libssl-dev Need Ocaml-3.10.2 or higher Remove the contents of the “cache directory” autom4te.cache in case “make” loops
bin/instrument file.c bin/compile file.instr.c insp ect –help inspect target inspect –s target
53
Examples Included With TutorialDining3Buggy.c : Initial attempt to write 3 Dining Philosophers. Since the code is symmetric, it has a deadlock. Testing misses it.
Dining3BuggyRace1.c: Initial attempt to tweak the code results in read / write race which Inspect finds (testing misses race + deadlock)
Dining3BuggyRace2.c: Another race is now exposed by Inspect
Dining3BuggyNoRace.c: All races removed. Now testing sometimes finds the deadlock. Inspect always finds it.
Dining3.c: This is the final bug-fixed version.
Dining5.c: Without DPOR, this should generate too many states. With DPOR, the number of states / transitions is far fewer.
sharedArrayRace.c: A shared array program with a race.
sharedArray.c: After fixing the race, stateless search does not finish.We need stateful search to finish.
54
Why DPOR?“Classic POR” often runs into trouble becausedependencies are often known only at runtime…
if odd(a) then a ++
if odd(a) then a ++
b ++
b ++
#define a A -> fld
#define b B -> fld
// A == B could be true or false…
55
DPOR helps enumerate all possible “happens-before” partial orders…
void * thread_A(void* arg){ pthread_mutex_lock(&mutex); A_count++; pthread_mutex_unlock(&mutex);}
void * thread_B(void * arg){ pthread_mutex_lock(&lock); B_count++; pthread_mutex_unlock(&lock);}
void * thread_C(void * arg){ pthread_mutex_lock(&mutex); A_count-- ; pthread_mutex_unlock(&mutex);}
CONSIDER AN EXECUTION:
pthread_mutex_lock(&mutex); A_count++;pthread_mutex_unlock(&mutex);
pthread_mutex_lock(&lock); B_count++;pthread_mutex_unlock(&lock);
pthread_mutex_lock(&mutex); A_count-- ;pthread_mutex_unlock(&mutex);
56
DPOR helps enumerate all possible “happens-before” partial orders…
THE FIRST EXECUTION:
pthread_mutex_lock(&mutex); A_count++;pthread_mutex_unlock(&mutex);
pthread_mutex_lock(&lock); B_count++;pthread_mutex_unlock(&lock);
pthread_mutex_lock(&mutex); A_count-- ;pthread_mutex_unlock(&mutex);
57
DPOR helps enumerate all possible “happens-before” partial orders…
THE FIRST EXECUTION:
pthread_mutex_lock(&mutex); A_count++;pthread_mutex_unlock(&mutex);
pthread_mutex_lock(&lock); B_count++;pthread_mutex_unlock(&lock);
pthread_mutex_lock(&mutex); A_count-- ;pthread_mutex_unlock(&mutex);
This partial order (“happens before”)determines the outcomeof verification !
58
Happens-Before is defined by the Transition Dependency Relation
Two transitions t1 and t2 of a concurrent program are dependent, if
– t1 and t2 belong to the same process, OR
» t1 and t2 are concurrently enabled, and
t1, t2 are: lock acquire operations on the same lock operations on the same global object and at least one of them is a write a WAIT and a SIGNAL on the same condition variable
Introduce an HB edge between every pair of
dependent operations in an execution
59
DPOR helps enumerate all possible “happens-before” partial orders…
First HAPPENS-BEFORE:
pthread_mutex_lock(&mutex); A_count++;pthread_mutex_unlock(&mutex);
pthread_mutex_lock(&lock); B_count++;pthread_mutex_unlock(&lock);
pthread_mutex_lock(&mutex); A_count-- ;pthread_mutex_unlock(&mutex);
Another “HAPPENS-BEFORE”
pthread_mutex_lock(&mutex); A_count-- ;pthread_mutex_unlock(&mutex);
pthread_mutex_lock(&lock); B_count++;pthread_mutex_unlock(&lock);
pthread_mutex_lock(&mutex); A_count++;pthread_mutex_unlock(&mutex);
60
Other details of DPOR
Happens-before maintained using Vector Clocks
Two transitions are concurrent if– They are not Happens-Before ordered
– They can be executed under disjoint lock-sets
DATA RACE– Two concurrent transitions enabled out of a state
– Both access the same variable and one is a write
61
Computation of “ample” sets in Static POR versus in DPOR
Exploring “Ample” sets at every statesuffices to generateall HB executions
CLASSICAL POR : AMPLE determined when S is reached
S
62
Computation of “ample” sets in Static POR versus in DPOR
Exploring “Ample” sets at every statesuffices to generateall HB executions
CLASSICAL POR : AMPLE determined when S is reached
S
DPOR:
This dependency
S
63
Computation of “ample” sets in Static POR versus in DPOR
Exploring “Ample” sets at every statesuffices to generateall HB executions
CLASSICAL POR : AMPLE determined when S is reached
S
DPOR:
This dependency helps EXTEND
THIS AMPLE SET !!
S
64
Computation of “ample” sets in Static POR versus in DPOR
Ample determinedusing “local” criteria
Current State
Next move of Red process
Nearest DependentTransitionLooking Back
Add Red Process to“Backtrack Set”
This builds the Ampleset incrementally based on observed dependencies
Blue is in “Done” set
{ BT }, { Done }
65
We target C/C++ PThread Programs Instrument the given program (largely automated) Run the concurrent program “till the end” Compute dependencies based on concrete run
information present in the runtime stack– This populates the Backtrack Sets -- points at which the execution
must be replayed
When an item (a process ID) is explored from the Backtrack Set, put it in the “done” set
Repeat till all the Backtrack Sets are empty
Putting it all together …
66
A Simple DPOR Example
{}, {}t0:
lock(t)
unlock(t)
t1:
lock(t)
unlock(t)
t2:
lock(t)
unlock(t)
{ BT }, { Done }
67
t0: lock{}, {}t0:
lock(t)
unlock(t)
t1:
lock(t)
unlock(t)
t2:
lock(t)
unlock(t)
A Simple DPOR Example{ BT }, { Done }
68
t0: lock
t0: unlock
{}, {}t0:
lock(t)
unlock(t)
t1:
lock(t)
unlock(t)
t2:
lock(t)
unlock(t)
A Simple DPOR Example{ BT }, { Done }
69
t0: lock
t0: unlock
t1: lock
{}, {}t0:
lock(t)
unlock(t)
t1:
lock(t)
unlock(t)
t2:
lock(t)
unlock(t)
A Simple DPOR Example{ BT }, { Done }
70
t0: lock
t0: unlock
t1: lock
{t1}, {t0}t0:
lock(t)
unlock(t)
t1:
lock(t)
unlock(t)
t2:
lock(t)
unlock(t)
A Simple DPOR Example{ BT }, { Done }
71
t0: lock
t0: unlock
t1: lock
t1: unlock
t2: lock
{t1}, {t0}
{}, {}
t0:
lock(t)
unlock(t)
t1:
lock(t)
unlock(t)
t2:
lock(t)
unlock(t)
A Simple DPOR Example{ BT }, { Done }
72
t0: lock
t0: unlock
t1: lock
t1: unlock
t2: lock
{t1}, {t0}
{t2}, {t1}
t0:
lock(t)
unlock(t)
t1:
lock(t)
unlock(t)
t2:
lock(t)
unlock(t)
A Simple DPOR Example{ BT }, { Done }
73
t0: lock
t0: unlock
t1: lock
t2: unlock
t1: unlock
t2: lock
{t1}, {t0}
{t2}, {t1}
t0:
lock(t)
unlock(t)
t1:
lock(t)
unlock(t)
t2:
lock(t)
unlock(t)
A Simple DPOR Example{ BT }, { Done }
74
t0: lock
t0: unlock
t1: lock
t1: unlock
t2: lock
{t1}, {t0}
{t2}, {t1}
t0:
lock(t)
unlock(t)
t1:
lock(t)
unlock(t)
t2:
lock(t)
unlock(t)
A Simple DPOR Example{ BT }, { Done }
75
t0: lock
t0: unlock
{t1}, {t0}
{t2}, {t1}
t0:
lock(t)
unlock(t)
t1:
lock(t)
unlock(t)
t2:
lock(t)
unlock(t)
A Simple DPOR Example{ BT }, { Done }
76
t0: lock
t0: unlock
t2: lock
{t1,t2}, {t0}
{}, {t1, t2}
t0:
lock(t)
unlock(t)
t1:
lock(t)
unlock(t)
t2:
lock(t)
unlock(t)
A Simple DPOR Example{ BT }, { Done }
77
t0: lock
t0: unlock
t2: lock
t2: unlock
{t1,t2}, {t0}
{}, {t1, t2}
t0:
lock(t)
unlock(t)
t1:
lock(t)
unlock(t)
t2:
lock(t)
unlock(t)
A Simple DPOR Example
…
{ BT }, { Done }
78
t0: lock
t0: unlock
{t1,t2}, {t0}
{}, {t1, t2}
t0:
lock(t)
unlock(t)
t1:
lock(t)
unlock(t)
t2:
lock(t)
unlock(t)
A Simple DPOR Example{ BT }, { Done }
79
{t2}, {t0,t1}t0:
lock(t)
unlock(t)
t1:
lock(t)
unlock(t)
t2:
lock(t)
unlock(t)
A Simple DPOR Example{ BT }, { Done }
80
t1: lock
t1: unlock
{t2}, {t0, t1}t0:
lock(t)
unlock(t)
t1:
lock(t)
unlock(t)
t2:
lock(t)
unlock(t)
A Simple DPOR Example
…
{ BT }, { Done }
81
Sequential Model Checking Times
Benchmark Threads Runs Time (sec)
fsbench 26 8,192 291.32
indexer 16 32,768 1188.73
aget 6 113,400 5662.96
bbuf 8 1,938,816 39710.43
82
worker a worker b
Request unloading
idle node id
work description
report result
load balancer
We have devised a work-distribution scheme (SPIN 2007)
83
Speedup on aget
84
Speedup on bbuf
85
Stateful Runtime Model Checking (to appear in SPIN 2008)
Method to remember search history (an approximate notion of ‘visited states’ maintained)
Avoiding unsoundness due to cutting off search
Avoiding unsoundness through efficient implementation
86
Recording visited states is hard Capturing the stacks of threads
and the heap at runtime is difficult.– All the state elements shown in this
figure must be recorded!
– A problem not faced by (e.g. JPF) and other “bytecode” based verification tools
– Native-code verification tools such as Inspect are inherently harder to build
Canonizing the heap and comparing pointers among executions are not straightforward.
87
Key observation
The changes between successive local states are often easy to capture.– It is common that the local state of thread does not change between
successive visible operations - δ-epsilon (no change)
– It is common that the local state change only involves a finite number of variables – δ-other (known change)
We can detect visited states among executions by tracking the changes of local states.
Also, Inspect’s “replay” based execution avoids the need to capture states such as process control block– Each time, Inspect recreates these states through replay
– Added bonus: Ease of parallelization
88
Detecting visited states
S1S1
thread 1: δ 1thread 1: δ 1
thread 2: δ 2thread 2: δ 2
S2S2
S3S3
(g1, [L1,M1])(g1, [L1,M1])
(g2, [L2,M1])(g2, [L2,M1]) S2S2
thread2: δ 2thread2: δ 2
(g2’, [L1,M2])(g2’, [L1,M2])
(g3, [L2,M2])(g3, [L2,M2])
a visited state!a visited state!
S3S3
thread 1: δ 1thread 1: δ 1
(g3, [L2,M2])(g3, [L2,M2])
Key idea :
• Local state changes are classified into
• δ-epsilon (no change)
• δ-bottom (unknown change)
• δ-other (known change)
• Uniquely name each non δ-bottom sequence of each thread
• May miss detecting revisits if δ1 o δ2 = δ2 o δ1
• Cheaply maintains local state info, and often detects revisits
IDs of local states held in thread-local hash tables
89
Detecting visited states
S1S1 (g1, [L1,M1])(g1, [L1,M1]) thread 1:
….
thread 1:
….
thread 2:
….
thread 2:
….
90
Detecting visited states
S1S1 (g1, [L1,M1])(g1, [L1,M1]) thread 1:
….
thread 1:
….
thread 2:
….
thread 2:
….
91
Detecting visited states
S1S1
thread 1: δ 1thread 1: δ 1
S2S2
(g1, [L1,M1])(g1, [L1,M1])
(g2, [L2,M1])(g2, [L2,M1])
thread 1:
L1+δ1 L2
…
thread 1:
L1+δ1 L2
…
thread 2:
…
thread 2:
…
92
Detecting visited states
S1S1
thread 1: δ 1thread 1: δ 1
thread 2: δ 2thread 2: δ 2
S2S2
S3S3
(g1, [L1,M1])(g1, [L1,M1])
(g2, [L2,M1])(g2, [L2,M1])
(g3, [L2,M2])(g3, [L2,M2])
thread 1:
L1+δ1 L2
….
thread 1:
L1+δ1 L2
….
thread 2:
M1+δ2 M2
….
thread 2:
M1+δ2 M2
….
93
Detecting visited states
S1S1
thread 1: δ 1thread 1: δ 1
thread 2: δ 2thread 2: δ 2
S2S2
S3S3
(g1, [L1,M1])(g1, [L1,M1])
(g2, [L2,M1])(g2, [L2,M1])
(g3, [L2,M2])(g3, [L2,M2])
thread 1:
L1+δ1 L2
….
thread 1:
L1+δ1 L2
….
thread 2:
M1+δ2 M2
….
thread 2:
M1+δ2 M2
….
94
Detecting visited states
S1S1
thread 1: δ 1thread 1: δ 1
thread 2: δ 2thread 2: δ 2
S2S2
S3S3
(g1, [L1,M1])(g1, [L1,M1])
(g2, [L2,M1])(g2, [L2,M1])
(g3, [L2,M2])(g3, [L2,M2])
thread 1:
L1+δ1 L2
….
thread 1:
L1+δ1 L2
….
thread 2:
M1+δ2 M2
….
thread 2:
M1+δ2 M2
….
95
Detecting visited states
S1S1
thread 1: δ 1thread 1: δ 1
thread 2: δ 2thread 2: δ 2
S2S2
S3S3
(g1, [L1,M1])(g1, [L1,M1])
(g2, [L2,M1])(g2, [L2,M1]) S2S2
thread2: δ 2thread2: δ 2
(g2’, [L1,M2])(g2’, [L1,M2])
(g3, [L2,M2])(g3, [L2,M2])
thread 1:
L1+δ1 L2
….
thread 1:
L1+δ1 L2
….
thread 2:
M1+δ2 M2
….
thread 2:
M1+δ2 M2
….
96
Detecting visited states
S1S1
thread 1: δ 1thread 1: δ 1
thread 2: δ 2thread 2: δ 2
S2S2
S3S3
(g1, [L1,M1])(g1, [L1,M1])
(g2, [L2,M1])(g2, [L2,M1]) S2S2
thread2: δ 2thread2: δ 2
(g2’, [L1,M2])(g2’, [L1,M2])
(g3, [L2,M2])(g3, [L2,M2])
a visited state!a visited state!
S3S3
thread 1: δ 1thread 1: δ 1
(g3, [L2,M2])(g3, [L2,M2])
thread 1:
L1+δ1 L2
….
thread 1:
L1+δ1 L2
….
thread 2:
M1+δ2 M2
….
thread 2:
M1+δ2 M2
….
97
Making Stateful DPOR Work Soundly and Efficiently
visited state
visited state
Stop here and backtrack?
Stop here and backtrack?
98
Naïve backtracking does not work!
dependentdependentvisited state
visited state
This part will not be traversedif backtrack naively!
This part will not be traversedif backtrack naively!
99
A quick fix on this problem
visited state
visited state
When a visited state is found, for states in the search stack, add all enabled transitions into the backtrack set
Problem with this fix– redundant backtrack
points
100
Our solution
Observation: the number of visible operations that threads can execute are usually small in number!
Solution: compute the summary of the sub-space using transition dependency graph
101
Our solution: Maintain visible operation dependency graph, and fill backtrack set only according to it…
visited state
visited state
Visible operation dependency graph
Visible operation dependency graph
102
Evaluation Two (realistic) benchmarks
– pfscan -- a parallel file scanner
– bzip2smp – a parallel file compressor
103
Evaluation
benchmark threads DPOR SDPOR
runs transitions time runs transitions time
bzip2smp 4 - - - 4,598 26.442 1311.15
bzip2smp 5 - - - 18,709 92,276 9546.34
bzip2smp 6 - - - 51,400 236,863 25659.4
pfscan 3 84 1,157 0.53 71 967 0.49
pfscan 4 13,617 189,218 240.74 3,168 40,395 57.43
pfscan 5 - - - 272,873 3,402,486 5328.84
104
RESULTS pertaining to ISP(to be presented at CAV 2008)
105
ISP is an entirely separate project… http://www.cs.utah.edu/formal_verification/ISP_Tests
MPI ProgramMPI Program
Simplified MPI
Program
Simplified MPI
Program
Simplifications
Actual MPI Library and Runtime
Actual MPI Library and Runtime
executableexecutable
Proc 1
Proc n
schedulerrequest/permit
request/permitcompile
PMPI callsOPERATION of ISP
106
MPI Program Verification Work prior to ISP
Siegel and Avrunin, Siegel – MPI programs modeled in Promela
» Models built by hand
– MPI-SPIN employs some “MPI-aware” reductions
» Version of SPIN with C functions serving to capture MPI
– Symbolic execution to compare sequential and concurrent algorithms
» Three precision levels of comparison
Efforts based on static analysis– Vuduc, Quinlan, de Supinski, Dwyer, Hoveland, …
» the usual attributes of static analysis apply
Dynamic Execution based Verification ESSENTIAL for MPI– Need to examine code-paths in MPI and user-level libraries
– Bugs may be in the “surrounding code”
– Dynamic reductions can dramatically reduce # of interleavings
– MPI programs often compute many things (communicators, send targets..)
107
Summary of ISP Dynamic verification of MPI programs suffers from
– the inability to externally control how the MPI runtime performs message matches for wildcard receive statements
– the inability to force a desired alternative execution
ISP overcomes these problems by– Exploiting MPI’s out-of-order semantics
» If the runtime does not follow certain program orderings, then one can afford to postpone the issue of certain MPI operations
» Such postponement allows one to determine the maximal set of message sends that can ever match with a receive
later, we show that this is also the basis of a barrier removal algorithm
– ISP execution strategy guarantees AMPLE sets at every point
Dealing with actual concurrent program runtimes in a DPOR approach is a growing reality to be confronted
108
ISP Results Found deadlocks missed by some other existing tools
– Testing tools such as Marmot, MPICH run from the terminal
Could finish examining some large benchmarks– Game of Life example (500+ lines of code)– “Lines of code” is an unfamiliar metric for MPI
» even 4 lines of code can be “hard”
This level of coverage unattainable through “testing”, in the presence of – non-determinism
» Never known statically whether code is deterministic
– too many processes» Blind interleaving can kill any testing strategy
Full table of results presented at http://www.cs.utah.edu/formal_verification/ISP_Tests
109
How testing can miss error1
P0---
MPI_Send(to P1…);
MPI_Send(to P1, data=22);
P1---
MPI_Recv(from P0…);
MPI_Recv(from P2…);
MPI_Recv(*, x);
if (x==22) then error1 else MPI_Recv(*, x);
P2---
MPI_Send(to P1…);
MPI_Send(to P1, data=33);
110
How testing can miss error1 (1)
P0---
MPI_Send(to P1…);
MPI_Send(to P1, data=22);
P1---
MPI_Recv(from P0…);
MPI_Recv(from P2…);
MPI_Recv(*, x);
if (x==22) then error1 else MPI_Recv(*, x);
P2---
MPI_Send(to P1…);
MPI_Send(to P1, data=33);
111
How testing can miss error1 (2)
P0---
MPI_Send(to P1…);
MPI_Send(to P1, data=22);
P1---
MPI_Recv(from P0…);
MPI_Recv(from P2…);
MPI_Recv(*, x);
if (x==22) then error1 else MPI_Recv(*, x);
P2---
MPI_Send(to P1…);
MPI_Send(to P1, data=33);
112
How testing can miss error1 (3)
P0---
MPI_Send(to P1…);
MPI_Send(to P1, data=22);
P1---
MPI_Recv(from P0…);
MPI_Recv(from P2…);
MPI_Recv(*, x);
if (x==22) then error1 else MPI_Recv(*, x);
P2---
MPI_Send(to P1…);
MPI_Send(to P1, data=33);
113
How testing can miss error1 (4)
P0---
MPI_Send(to P1…);
MPI_Send(to P1, data=22);
P1---
MPI_Recv(from P0…);
MPI_Recv(from P2…);
MPI_Recv(*, x);
if (x==22) then error1 else MPI_Recv(*, x);
P2---
MPI_Send(to P1…);
MPI_Send(to P1, data=33);
114
How testing can miss error1 (5)
P0---
MPI_Send(to P1…);
MPI_Send(to P1, data=22);
P1---
MPI_Recv(from P0…);
MPI_Recv(from P2…);
MPI_Recv(*, x);
if (x==22) then error1 else MPI_Recv(*, x);
P2---
MPI_Send(to P1…);
MPI_Send(to P1, data=33);
unlucky
115
How testing can miss error1 (6)
P0---
MPI_Send(to P1…);
MPI_Send(to P1, data=22);
P1---
MPI_Recv(from P0…);
MPI_Recv(from P2…);
MPI_Recv(*, x);
if (x==22) then error1 else MPI_Recv(*, x);
P2---
MPI_Send(to P1…);
MPI_Send(to P1, data=33);
lucky
116
How ISP efficiently catches error1
P0---
MPI_Send(to P1…);
MPI_Send(to P1, data=22);
P1---
MPI_Recv(from P0…);
MPI_Recv(from P2…);
MPI_Recv(*, x);
if (x==22) then error1 else MPI_Recv(*, x);
P2---
MPI_Send(to P1…);
MPI_Send(to P1, data=33);
Avoid un-necessary interleavings here, thanks to POR
117
How ISP efficiently catches error1
P0---
MPI_Send(to P1…);
MPI_Send(to P1, data=22);
P1---
MPI_Recv(from P0…);
MPI_Recv(from P2…);
MPI_Recv(*, x);
if (x==22) then error1 else MPI_Recv(*, x);
P2---
MPI_Send(to P1…);
MPI_Send(to P1, data=33);
Avoid un-necessary interleavings here, thanks to POR
Consider both matches here, thanks to dynamic rewrite and recursive expansion
118
Illustration of the POE AlgorithmConventions: encountered, rewritten, fired
P0---
MPI_Irecv(*, &req);
MPI_Barrier();
MPI_Wait(&req);
MPI_Recv(from 2);
P1---
MPI_Barrier();
MPI_Isend(to 0, &req);
MPI_Wait(&req);
P2---
MPI_Isend(to 0, &req);
MPI_Barrier();
MPI_Wait(&req);
119
Illustration of the POE AlgorithmConventions: encountered, rewritten, fired
P0---
MPI_Irecv(*, &req);
MPI_Barrier();
MPI_Wait(&req);
MPI_Recv(from 2);
P1---
MPI_Barrier();
MPI_Isend(to 0, &req);
MPI_Wait(&req);
P2---
MPI_Isend(to 0, &req);
MPI_Barrier();
MPI_Wait(&req);
120
Illustration of the POE AlgorithmConventions: encountered, rewritten, fired
P0---
MPI_Irecv(*, &req);
MPI_Barrier();
MPI_Wait(&req);
MPI_Recv(from 2);
P1---
MPI_Barrier();
MPI_Isend(to 0, &req);
MPI_Wait(&req);
P2---
MPI_Isend(to 0, &req);
MPI_Barrier();
MPI_Wait(&req);
121
Illustration of the POE AlgorithmConventions: encountered, rewritten, fired
P0---
MPI_Irecv(*, &req);
MPI_Barrier();
MPI_Wait(&req);
MPI_Recv(from 2);
P1---
MPI_Barrier();
MPI_Isend(to 0, &req);
MPI_Wait(&req);
P2---
MPI_Isend(to 0, &req);
MPI_Barrier();
MPI_Wait(&req);
Is a fence, hence switch over to P1…
122
Illustration of the POE AlgorithmConventions: encountered, rewritten, fired
P0---
MPI_Irecv(*, &req);
MPI_Barrier();
MPI_Wait(&req);
MPI_Recv(from 2);
P1---
MPI_Barrier();
MPI_Isend(to 0, &req);
MPI_Wait(&req);
P2---
MPI_Isend(to 0, &req);
MPI_Barrier();
MPI_Wait(&req);
123
Illustration of the POE AlgorithmConventions: encountered, rewritten, fired
P0---
MPI_Irecv(*, &req);
MPI_Barrier();
MPI_Wait(&req);
MPI_Recv(from 2);
P1---
MPI_Barrier();
MPI_Isend(to 0, &req);
MPI_Wait(&req);
P2---
MPI_Isend(to 0, &req);
MPI_Barrier();
MPI_Wait(&req);
Is a fence, hence switch over to P2…
124
Illustration of the POE AlgorithmConventions: encountered, rewritten, fired
P0---
MPI_Irecv(*, &req);
MPI_Barrier();
MPI_Wait(&req);
MPI_Recv(from 2);
P1---
MPI_Barrier();
MPI_Isend(to 0, &req);
MPI_Wait(&req);
P2---
MPI_Isend(to 0, &req);
MPI_Barrier();
MPI_Wait(&req);
Likewise, do it for P2 (multiple steps shown here)…
125
Illustration of the POE AlgorithmConventions: encountered, rewritten, fired
P0---
MPI_Irecv(*, &req);
MPI_Barrier();
MPI_Wait(&req);
MPI_Recv(from 2);
P1---
MPI_Barrier();
MPI_Isend(to 0, &req);
MPI_Wait(&req);
P2---
MPI_Isend(to 0, &req);
MPI_Barrier();
MPI_Wait(&req);
All processes have hit a fence
Now form Match Sets - you’ll see which are the match sets by those actions turning red!
Issue Match Sets in priority order
126
Illustration of the POE AlgorithmConventions: encountered, rewritten, fired
P0---
MPI_Irecv(*, &req);
MPI_Barrier();
MPI_Wait(&req);
MPI_Recv(from 2);
P1---
MPI_Barrier();
MPI_Isend(to 0, &req);
MPI_Wait(&req);
P2---
MPI_Isend(to 0, &req);
MPI_Barrier();
MPI_Wait(&req);
First priority match-sets are Barriers
Their ancestors have fired (they have no ancestors!)
SO WE CAN LET THEM FIRE
127
Illustration of the POE AlgorithmConventions: encountered, rewritten, fired
P0---
MPI_Irecv(*, &req);
MPI_Barrier();
MPI_Wait(&req);
MPI_Recv(from 2);
P1---
MPI_Barrier();
MPI_Isend(to 0, &req);
MPI_Wait(&req);
P2---
MPI_Isend(to 0, &req);
MPI_Barrier();
MPI_Wait(&req);
Now all procs are not at a fence, and so the execution has to advance each process to the next fence – here we show multiple suchsteps leading to the next fence…
128
Illustration of the POE AlgorithmConventions: encountered, rewritten, fired
P0---
MPI_Irecv(*, &req);
MPI_Barrier();
MPI_Wait(&req);
MPI_Recv(from 2);
P1---
MPI_Barrier();
MPI_Isend(to 0, &req);
MPI_Wait(&req);
P2---
MPI_Isend(to 0, &req);
MPI_Barrier();
MPI_Wait(&req);
Now, the only eligible ops whose ancestors have fired are shown in blue
129
Illustration of the POE AlgorithmConventions: encountered, rewritten, fired
P0---
MPI_Irecv(*, &req);
MPI_Barrier();
MPI_Wait(&req);
MPI_Recv(from 2);
P1---
MPI_Barrier();
MPI_Isend(to 0, &req);
MPI_Wait(&req);
P2---
MPI_Isend(to 0, &req);
MPI_Barrier();
MPI_Wait(&req);
They also happen to be the ones needing a dynamicrewrite into specific receives… we will show TWO cases now and then pursue one of the cases.
130
Illustration of the POE AlgorithmConventions: encountered, rewritten, fired
P0---
MPI_Irecv(from 1, &req);
MPI_Barrier();
MPI_Wait(&req);
MPI_Recv(from 2);
P1---
MPI_Barrier();
MPI_Isend(to 0, &req);
MPI_Wait(&req);
P2---
MPI_Isend(to 0, &req);
MPI_Barrier();
MPI_Wait(&req);
Case 1
131
Illustration of the POE AlgorithmConventions: encountered, rewritten, fired
P0---
MPI_Irecv(from 2, &req);
MPI_Barrier();
MPI_Wait(&req);
MPI_Recv(from 2);
P1---
MPI_Barrier();
MPI_Isend(to 0, &req);
MPI_Wait(&req);
P2---
MPI_Isend(to 0, &req);
MPI_Barrier();
MPI_Wait(&req);
Case 2
132
Illustration of the POE AlgorithmConventions: encountered, rewritten, fired
P0---
MPI_Irecv(from 1, &req);
MPI_Barrier();
MPI_Wait(&req);
MPI_Recv(from 2);
P1---
MPI_Barrier();
MPI_Isend(to 0, &req);
MPI_Wait(&req);
P2---
MPI_Isend(to 0, &req);
MPI_Barrier();
MPI_Wait(&req);
Pursuing Case 1, we get this…
133
Illustration of the POE AlgorithmConventions: encountered, rewritten, fired
P0---
MPI_Irecv(from 1, &req);
MPI_Barrier();
MPI_Wait(&req);
MPI_Recv(from 2);
P1---
MPI_Barrier();
MPI_Isend(to 0, &req);
MPI_Wait(&req);
P2---
MPI_Isend(to 0, &req);
MPI_Barrier();
MPI_Wait(&req);
And then this…
134
Illustration of the POE AlgorithmConventions: encountered, rewritten, fired
P0---
MPI_Irecv(from 1, &req);
MPI_Barrier();
MPI_Wait(&req);
MPI_Recv(from 2);
P1---
MPI_Barrier();
MPI_Isend(to 0, &req);
MPI_Wait(&req);
P2---
MPI_Isend(to 0, &req);
MPI_Barrier();
MPI_Wait(&req);
And then this…
135
Illustration of the POE AlgorithmConventions: encountered, rewritten, fired
P0---
MPI_Irecv(from 1, &req);
MPI_Barrier();
MPI_Wait(&req);
MPI_Recv(from 2);
P1---
MPI_Barrier();
MPI_Isend(to 0, &req);
MPI_Wait(&req);
P2---
MPI_Isend(to 0, &req);
MPI_Barrier();
MPI_Wait(&req);
And then this…
136
Illustration of the POE AlgorithmConventions: encountered, rewritten, fired
P0---
MPI_Irecv(from 1, &req);
MPI_Barrier();
MPI_Wait(&req);
MPI_Recv(from 2);
P1---
MPI_Barrier();
MPI_Isend(to 0, &req);
MPI_Wait(&req);
P2---
MPI_Isend(to 0, &req);
MPI_Barrier();
MPI_Wait(&req);
And finally this!
137
RESULTS pertaining to FIB
138
Summary of FIB MPI_Barrier() calls within an MPI program are
employed to constrain executions– always more executions without a barrier than with
Barrier uses and desired checks– Used to streamline I / O
» Barrier had better be FUNCTIONALLY IRRELEVANT– To prevent certain message matches from occurring
» Barriers had better be FUNCTIONALLY RELEVANT
Hitherto no algorithm to verify (for all program inputs) whether a barrier is an FIB or a FRB
We offer an algorithm (called “Fib”) that finds all FIBs for a given input
Through static analysis, we can sometimes extrapolate this result to cover ALL possible inputs– for MPI programs that do not have “data dependent control flows”
139
Fib Overview – is this barrier relevant ?
P0---
MPI_Irecv(*, &req);
MPI_Wait(&req);
MPI_Barrier();
MPI_Finalize();
P1---
MPI_Isend(to 0, 33);
MPI_Barrier();
MPI_Finalize();
P2---
MPI_Barrier();
MPI_Isend(to P0, 22);
MPI_Finalize();
140
IntraCB Edges (how much program order maintained in executions)
P0---
MPI_Irecv(*, &req);
MPI_Wait(&req);
MPI_Barrier();
MPI_Finalize();
P1---
MPI_Isend(to 0, 33);
MPI_Barrier();
MPI_Finalize();
P2---
MPI_Barrier();
MPI_Isend(to P0, 22);
MPI_Finalize();
141
IntraCB (implied transitivity)
P0---
MPI_Irecv(*, &req);
MPI_Wait(&req);
MPI_Barrier();
MPI_Finalize();
P1---
MPI_Isend(to 0, 33);
MPI_Barrier();
MPI_Finalize();
P2---
MPI_Barrier();
MPI_Isend(to P0, 22);
MPI_Finalize();
142
InterCB introduction: for any x,y in a match set, add InterCB from x to every IntraCB successor of y
P0---
MPI_Irecv(*, &req);
MPI_Wait(&req);
MPI_Barrier();
MPI_Finalize();
P1---
MPI_Isend(to 0, 33);
MPI_Barrier();
MPI_Finalize();
P2---
MPI_Barrier();
MPI_Isend(to P0, 22);
MPI_Finalize();
143
InterCB introduction: for any x,y in a match set, add InterCB from x to every IntraCB successor of y
P0---
MPI_Irecv(from 1, &req);
MPI_Wait(&req);
MPI_Barrier();
MPI_Finalize();
P1---
MPI_Isend(to 0, 33);
MPI_Barrier();
MPI_Finalize();
P2---
MPI_Barrier();
MPI_Isend(to P0, 22);
MPI_Finalize();
Match set formedduring POE
144
InterCB introduction: for any x,y in a match set, add InterCB from x to every IntraCB successor of y
P0---
MPI_Irecv(from 1, &req);
MPI_Wait(&req);
MPI_Barrier();
MPI_Finalize();
P1---
MPI_Isend(to 0, 33);
MPI_Barrier();
MPI_Finalize();
P2---
MPI_Barrier();
MPI_Isend(to P0, 22);
MPI_Finalize();
Match set formedduring POE
InterCB
145
InterCB introduction: for any x,y in a match set, add InterCB from x to every IntraCB successor of y
P0---
MPI_Irecv(from 1, &req);
MPI_Wait(&req);
MPI_Barrier();
MPI_Finalize();
P1---
MPI_Isend(to 0, 33);
MPI_Barrier();
MPI_Finalize();
P2---
MPI_Barrier();
MPI_Isend(to P0, 22);
MPI_Finalize();
InterCB
146
InterCB introduction: for any x,y in a match set, add InterCB from x to every IntraCB successor of y
P0---
MPI_Irecv(from 1, &req);
MPI_Wait(&req);
MPI_Barrier();
MPI_Finalize();
P1---
MPI_Isend(to 0, 33);
MPI_Barrier();
MPI_Finalize();
P2---
MPI_Barrier();
MPI_Isend(to P0, 22);
MPI_Finalize();
InterCB
InterCB
147
Continue adding InterCB as the execution advancesHere, we pick the Barriers to be the match set next…
P0---
MPI_Irecv(from 1, &req);
MPI_Wait(&req);
MPI_Barrier();
MPI_Finalize();
P1---
MPI_Isend(to 0, 33);
MPI_Barrier();
MPI_Finalize();
P2---
MPI_Barrier();
MPI_Isend(to P0, 22);
MPI_Finalize();
InterCB
InterCB
148
Continue adding InterCB as the execution advancesHere, we pick the Barriers to be the match set next…
P0---
MPI_Irecv(from 1, &req);
MPI_Wait(&req);
MPI_Barrier();
MPI_Finalize();
P1---
MPI_Isend(to 0, 33);
MPI_Barrier();
MPI_Finalize();
P2---
MPI_Barrier();
MPI_Isend(to P0, 22);
MPI_Finalize();
InterCB
InterCB
149
… newly added InterCBs (only some of them shown…)
P0---
MPI_Irecv(from 1, &req);
MPI_Wait(&req);
MPI_Barrier();
MPI_Finalize();
P1---
MPI_Isend(to 0, 33);
MPI_Barrier();
MPI_Finalize();
P2---
MPI_Barrier();
MPI_Isend(to P0, 22);
MPI_Finalize();
InterCB
InterCB
InterCB
InterCB
150
Now the question pertains to what was a wild-card receive and a potential sender that could have matched…
P0---
MPI_Irecv(was *, &req);
MPI_Wait(&req);
MPI_Barrier();
MPI_Finalize();
P1---
MPI_Isend(to 0, 33);
MPI_Barrier();
MPI_Finalize();
P2---
MPI_Barrier();
MPI_Isend(to P0, 22);
MPI_Finalize();
InterCB
InterCB
InterCB
InterCB
151
If they are ordered by a Barrier and NO OTHER OPERATION, then the Barrier is RELEVANT…
P0---
MPI_Irecv(was *, &req);
MPI_Wait(&req);
MPI_Barrier();
MPI_Finalize();
P1---
MPI_Isend(to 0, 33);
MPI_Barrier();
MPI_Finalize();
P2---
MPI_Barrier();
MPI_Isend(to P0, 22);
MPI_Finalize();
InterCB
InterCB
InterCB
InterCB
152
If they are ordered by a Barrier and NO OTHER OPERATION, then the Barrier is RELEVANT…
P0---
MPI_Irecv(was *, &req);
MPI_Wait(&req);
MPI_Barrier();
MPI_Finalize();
P1---
MPI_Isend(to 0, 33);
MPI_Barrier();
MPI_Finalize();
P2---
MPI_Barrier();
MPI_Isend(to P0, 22);
MPI_Finalize();
InterCB
InterCB
InterCB
InterCB
153
If they are ordered by a Barrier and NO OTHER OPERATION, then the Barrier is RELEVANT…
P0---
MPI_Irecv(was *, &req);
MPI_Wait(&req);
MPI_Barrier();
MPI_Finalize();
P1---
MPI_Isend(to 0, 33);
MPI_Barrier();
MPI_Finalize();
P2---
MPI_Barrier();
MPI_Isend(to P0, 22);
MPI_Finalize();
InterCB
InterCB
InterCB
InterCB
154
If they are ordered by a Barrier and NO OTHER OPERATION, then the Barrier is RELEVANT…
P0---
MPI_Irecv(was *, &req);
MPI_Wait(&req);
MPI_Barrier();
MPI_Finalize();
P1---
MPI_Isend(to 0, 33);
MPI_Barrier();
MPI_Finalize();
P2---
MPI_Barrier();
MPI_Isend(to P0, 22);
MPI_Finalize();
InterCB
InterCB
InterCB
InterCB
155
In this example, the Barrier is relevant !!
P0---
MPI_Irecv(was *, &req);
MPI_Wait(&req);
MPI_Barrier();
MPI_Finalize();
P1---
MPI_Isend(to 0, 33);
MPI_Barrier();
MPI_Finalize();
P2---
MPI_Barrier();
MPI_Isend(to P0, 22);
MPI_Finalize();
InterCB
InterCB
InterCB
InterCB
156
To flag something FIB (irrelevant), it has to remain irrelevant over all POE-reduced Interleavings !!
P0---
MPI_Irecv(was *, &req);
MPI_Wait(&req);
MPI_Barrier();
MPI_Finalize();
P1---
MPI_Isend(to 0, 33);
MPI_Barrier();
MPI_Finalize();
P2---
MPI_Barrier();
MPI_Isend(to P0, 22);
MPI_Finalize();
InterCB
InterCB
InterCB
InterCB
157
Summary of Fib
Algorithm has been implemented within the ISP tool
Very low overhead (so can keep it turned ‘on’ always…)
Identified FIBs and FRBs in many examples – manually checked correctness of identification
Simple static analysis facility has helped extend this claim over ALL external drivers
Future work : extend reach of this static analysis method to be able to claim FIB over all possible inputs
158
Concluding Remarks
We are “sold” on the merits of dynamic analysis– Gives a sense of realism
– Can give designers “debugger-like” interfaces
» yet “verifier-like” coverages
Going Forward– Bug-preserving scaling methods are essential to develop
» For MPI, OpenMP, …
– Collaboration between API designers and verification tool builders
» Make APIs easier to use in a “verification mode”
» Keep “verification mode” and “execution mode” semantics in agreement
» The plethora of concurrency APIs seem to require early attention to such a “verification mode API”
– Static and Dynamic Analysis can work with synergy
159
Extra slides
160
Looking Further Ahead: Need to clear “idea log-jam in multi-core computing…”
“There isn’t such a thing as Republican clean air or Democratic clean air. We all breathe the same air.”
There isn’t such a thing as an architectural-only solution, or a compilers-only solution to future problems in multi-core computing…
161
Now you see it; Now you don’t !
On the menace of non reproducible bugs.
Deterministic replay must ideally be an option User programmable schedulers greatly
emphasized by expert developers Runtime model-checking methods with state-
space reduction holds promise in meshing with current practice…
162
Computing Ample Sets (basic idea)
Ideal situation:– No path via the green triangle will ever “wake-
up” the disabled red transitions
» If this can be established, we can avoid interleaving the greens with the blues…
If dependence cannot be precisely computed, we will interleave the greens with the blues – too many such interleavings!
Some arbitrarytransition t
t’s dependency closure
Disabled dependents(wrt t)
Enabled independents(wrt t)
The transitions goingout of a state S (belongingto different processes) can bedivided into three groups…
S