Inspect, ISP, and FIB: reduction-based verification and analysis tools for concurrent programs Research Group: Yu Yang, Xiaofang Chen, Sarvani Vakkalanka,

Inspect, ISP, and FIB:reduction-based verification and analysis tools

for concurrent programs

Research Group: Yu Yang, Xiaofang Chen, Sarvani Vakkalanka, Subodh Sharma, Anh Vo, Michael DeLisi, Geof SawayaFaculty: Ganesh Gopalakrishnan (speaker), and Robert M. Kirby

School of Computing, University of Utah, Salt Lake City, UT

[email protected] http://www.cs.utah.edu/formal_verification

Talk at MSR India, Bangalore Research Labs, June 6, 2008

Supported by Microsoft HPC Center Grant, NSF CNS-0509379, SRC TJ 1318

2

Multicores are the future! Need to employ / teach concurrent programming at an unprecedented scale!

(photo courtesy of Intel

Corporation.)

Some of today’s proposals:

Threads (various)

Message Passing (various)

Transactional Memory (various)

OpenMP

MPI

Intel’s Ct

Microsoft’s Parallel Fx

Cilk Arts’s Cilk

Intel’s TBB

Nvidia’s Cuda

…

3

Q: What tool feature is desired

for any concurrency approach ?

Threads

Message Passing

Transactional Memory

OpenMP

MPI

Ct

Parallel Fx

Cilk

TBB

Cuda

…

4

Q: What tool feature is desired

for any concurrency approach ?

Threads

Message Passing

Transactional Memory

OpenMP

MPI

Ct

Parallel Fx

Cilk

TBB

Cuda

…

A: The ability to verify over all RELEVANT

interleavings !

Will have different “grain” sizes

(notions of atomicity)

Different types of interactions

between “threads / processes”

Different kinds of bugs

-- deadlocks

-- data races

-- communication races

-- memory leaks

Yet, the basics of verification remains

achieving the effect of having examined

all possible interleavings by only

exploring representative interleavings

5

An exponential number of interleavings…Need sound criteria for reductions (e.g. POR).

0: 1: 2: 3: 4: 5:

Card Deck 0 Card Deck 1

0: 1: 2: 3: 4: 5:

• Suppose only the interleavings of the red cards matter

• Then don’t try all riffle-shuffles (12!) / ((6!) (6!)) = 924

• Just do TWO shuffles !!

6

The Growth of (n.p)! / (n!)p

1: 2: 3: 4: …n:

Thread 1 …. Thread p

1: 2: 3: 4: …n:

• Unity / Murphi “guard / action” rules : n=1, p=R R! interleavings

• p = 3, n = 5 106 interleavings

• p = 3, n = 6 17 * 106 interleavings

• p = 4, n = 5 1010 interleavings

The situation is worse, because each statement (card) produces different state transformations…

7

Ad-hoc Testing is INEFFECTIVE for thread verification !

1: 2: 3: 4: …n:


1: 2: 3: 4: …n:

8

Ad-hoc Testing is INEFFECTIVE for thread verification !

1: 2: 3: 4: …n:


1: 2: 3: 4: …n:

Need Sound and Practically Justifiable Reduction Techniques !!

9

Growing Need to Verify Real-world Concurrent Programs

One often discovers what one is doing through programming!–Need a safety net !

‘Correct by construction’ methods work only around stable ideas– Multicore programming not there yet

Tools needed for TODAY’s problems– Inspect is one such tool

10

The need for dynamic verification: Too many realities – e.g. the code eliminated may have the bug

#include <stdlib.h> // Dining Philosophers with no deadlock#include <pthread.h> // all phils but "odd" one pickup their#include <stdio.h> // left fork first; odd phil picks#include <string.h> // up right fork first#include <malloc.h>#include <errno.h>#include <sys/types.h>#include <assert.h>

#define NUM_THREADS 3

pthread_mutex_t mutexes[NUM_THREADS];pthread_cond_t conditionVars[NUM_THREADS];int permits[NUM_THREADS];pthread_t tids[NUM_THREADS];

int data = 0;

void * Philosopher(void * arg){ int i; i = (int)arg;

// pickup left fork pthread_mutex_lock(&mutexes[i%NUM_THREADS]); while (permits[i%NUM_THREADS] == 0) { printf("P%d : tryget F%d\n", i, i%NUM_THREADS); pthread_cond_wait(&conditionVars[i%NUM_THREADS],&mutexes[i%NUM_THREADS]); }

11

Threads are too low level – Ed Lee – “The Problem with Threads”

» Complex global effects

» Semantics are not compositional

» Non-robustness against failure

Yet, alternative proposals are in a state of flux– OpenMP

– Transactional Memory

– …

Will need threads to IMPLEMENT alternate proposals!

Need to support dynamic verification of thread programs

12

Each parallel programming API class / approach seems to warrant its own dynamic verification approach

May be possible to employ common instrumentation / replay mechanisms in the long run

As of now, one has to build customized implementations

– We have implementations for Threads (Inspect) and MPI (ISP)

– We are building an implementation for OpenMP (based on a backtrackable version of OMPi from Greece)

Speaking in general

13

(BlueGene/L - Image courtesy of IBM / LLNL) (Image courtesy of Steve Parker, CSAFE, Utah)

Reason for our interest in the Message Passing Interface

• MPI is the de facto standard for programming clusters

A large API with over 300 functions, widely supported

A custom-made dynamic verification approach is needed

The need for FV solutions is acutely felt in this area…

14

The success of MPI over many apps (Courtesy of Al Geist, EuroPVM / MPI 2007)

15

MPI is complex …

– Send

– Receive

– Send / Receive

– Send / Receive / Replace

– Broadcast

– Barrier

– Reduce

– Rendezvous mode

– Blocking mode

– Non-blocking mode

– Reliance on system buffering

– User-attached buffering

– Restarts/Cancels of MPI Operations

– Non Wildcard receives

– Wildcard receives

– Tag matching

– Communication spaces

An MPI program is an interesting (and legal)combination of elementsfrom these spaces

16

MPI is complex …

– Send

– Receive

– Send / Receive


– Broadcast

– Barrier

– Reduce

– Rendezvous mode

– Blocking mode







– Tag matching



Yet, the complexity seems unavoidable to succeed at the scale of MPI’s deployment…

17

MPI is complex …

– Send

– Receive

– Send / Receive


– Broadcast

– Barrier

– Reduce

– Rendezvous mode

– Blocking mode







– Tag matching



We have defined a formal semantics for 150 / 300 MPI functionsin TLA+ (soon to try other notations – e.g. SAL)

18

Automated verification of common mistakes

– Deadlocks

– Communication Races

– Resource Leaks

Our approach with MPI: go after “low hanging bugs”

19

Deadlock pattern…

04/18/23

P0 P1--- ---

s(P1); s(P0);

r(P1); r(P0);

P0 P1

--- ---

Bcast; Barrier;

Barrier; Bcast;

20

Communication Race Pattern…

04/18/23

P0 P1 P2--- --- ---r(*); s(P0); s(P0);

r(P1);

P0 P1 P2--- --- ---r(*); s(P0); s(P0);

r(P1);

OK

NOK

21

Resource Leak Pattern…

04/18/23

P0---some_allocation_op(&handle);

FORGOTTEN DEALLOC !!

22

Why is even this much debugging hard?

The “crooked barrier” quiz will show you why…

P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( ANY )

MPI_Barrier

Will P1’s Send Match P2’s Receive ?

23

P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( ANY )

MPI_Barrier

It will ! Here is the animation

MPI Behavior

The “crooked barrier” quiz

24

P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( ANY )

MPI_Barrier

MPI Behavior


25

P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( ANY )

MPI_Barrier

MPI Behavior


26

P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( ANY )

MPI_Barrier

MPI Behavior


27

P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( ANY )

MPI_Barrier

MPI Behavior


28

P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( ANY )

MPI_Barrier

MPI Behavior


29

P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( ANY )

MPI_Barrier

MPI Behavior


We need a dynamic verification approach to be aware of the details of the API behavior…

30

Another motivating example:

we should not multiply out the interleavings of P0-P2 against those of P3-P5

P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( * )

MPI_Barrier

P3---

MPI_Isend ( P5 )

MPI_Barrier

P4---

MPI_Barrier

MPI_Isend( P5 )

P5---

MPI_Irecv ( * )

MPI_Barrier

31

Results pertaining to Inspect(to be presented at SPIN 2008)

32

Inspect Workflow

Multithreaded C/C++ program

Multithreaded C/C++ program

instrumented program

instrumented program

instrumentation

Thread library wrapper

Thread library wrapper

compile

executableexecutable

thread 1

thread n

schedulerrequest/permit

request/permit

33

Overview

Scheduleraction request

thread

permission

Unix domain sockets

Message Buffer

State stack

DPOR

Unix domain sockets

Visible operation interceptor

Program under test

34

Message Types

Thread creation/termination messages Visible operation request

– acquire/release locks

– wait for/send signals

– read/write shared object

Other helper messages– local state changes

– ...

35

Overview of the source transformation

Inter-procedural Flow-sensitive Alias Analysis

Thread Escape Analysis

Source code transformation

Intra-procedural Dataflow Analysis

Instrumented Program

Multithreaded C Program

36

Source code transformation (1)

functions calls to the thread library routine

functions calls to the Inspect library wrapper

pthread_create

inspect_thread_create

In detail:

pthread_mutex_lock

inspect_mutex_lock

…

37

Source code transformation (2)

x = rhs;

write_shared_xxx(&x, rhs);…..void write_shared_xxx(type * addr, type val){ inspect_obj_write(addr); *addr = val;}

lhs = x;

read_shared_xxx(&lhs, &x);…..void read_shared_xxx(type * lhs, type * addr){ inspect_obj_read(addr); *lhs = *addr;}

38

Source Transformation (3)

thread_routine(…){

…

}

thread_routine(…){ inspect_thread_begin();

…

inspect_thread_end();}

39

Source Transformation (4)

visible operation 1

…

visible operation 2

visible operation 1

…inspect_local_changes(….)

visible operation 2

40

Result of instrumentation

void * Philosopher(void * arg){ int i; i = (int)arg; ... pthread_mutex_lock(&mutexes[i%3]); ... while (permits[i%3] == 0) { printf("P%d : tryget F%d\n", i, i%3); pthread_cond_wait(...); } ... permits[i%3] = 0; ... pthread_cond_signal(&conditionVars[i%3]); pthread_mutex_unlock(&mutexes[i%3]); return NULL;}

void *Philosopher(void *arg ) { int i ; pthread_mutex_t *tmp ; { inspect_thread_start("Philosopher"); i = (int )arg; tmp = & mutexes[i % 3]; … inspect_mutex_lock(tmp); … while (1) { __cil_tmp43 = read_shared_0(& permits[i % 3]); if (! __cil_tmp32) { break; } __cil_tmp33 = i % 3; … tmp___0 = __cil_tmp33; … inspect_cond_wait(...); } ... write_shared_1(& permits[i % 3], 0); ... inspect_cond_signal(tmp___25); ... inspect_mutex_unlock(tmp___26); ... inspect_thread_end(); return (__retres31);}

41

Philosophers in PThreads…




int data = 0;



permits[i%NUM_THREADS] = 0; printf("P%d : get F%d\n", i, i%NUM_THREADS); pthread_mutex_unlock(&mutexes[i%NUM_THREADS]);

// pickup right fork pthread_mutex_lock(&mutexes[(i+1)%NUM_THREADS]); while (permits[(i+1)%NUM_THREADS] == 0) { printf("P%d : tryget F%d\n", i, (i+1)%NUM_THREADS); pthread_cond_wait(&conditionVars[(i+1)%NUM_THREADS],&mutexes[(i+1)%NUM_THREADS]); } permits[(i+1)%NUM_THREADS] = 0; printf("P%d : get F%d\n", i, (i+1)%NUM_THREADS); pthread_mutex_unlock(&mutexes[(i+1)%NUM_THREADS]);

//printf("philosopher %d thinks \n",i); printf("%d\n", i);

// data = 10 * data + i;

fflush(stdout);

// putdown right fork pthread_mutex_lock(&mutexes[(i+1)%NUM_THREADS]); permits[(i+1)%NUM_THREADS] = 1; printf("P%d : put F%d\n", i, (i+1)%NUM_THREADS); pthread_cond_signal(&conditionVars[(i+1)%NUM_THREADS]); pthread_mutex_unlock(&mutexes[(i+1)%NUM_THREADS]);

42

…Philosophers in PThreads // putdown left fork pthread_mutex_lock(&mutexes[i%NUM_THREADS]); permits[i%NUM_THREADS] = 1; printf("P%d : put F%d \n", i, i%NUM_THREADS); pthread_cond_signal(&conditionVars[i%NUM_THREADS]); pthread_mutex_unlock(&mutexes[i%NUM_THREADS]);

// putdown right fork pthread_mutex_lock(&mutexes[(i+1)%NUM_THREADS]); permits[(i+1)%NUM_THREADS] = 1; printf("P%d : put F%d \n", i, (i+1)%NUM_THREADS); pthread_cond_signal(&conditionVars[(i+1)%NUM_THREADS]); pthread_mutex_unlock(&mutexes[(i+1)%NUM_THREADS]);

return NULL;

}

int main(){ int i;

for (i = 0; i < NUM_THREADS; i++) pthread_mutex_init(&mutexes[i], NULL); for (i = 0; i < NUM_THREADS; i++) pthread_cond_init(&conditionVars[i], NULL); for (i = 0; i < NUM_THREADS; i++) permits[i] = 1;

for (i = 0; i < NUM_THREADS-1; i++){ pthread_create(&tids[i], NULL, Philosopher, (void*)(i) ); }

pthread_create(&tids[NUM_THREADS-1], NULL, OddPhilosopher, (void*)(NUM_THREADS-1) );

for (i = 0; i < NUM_THREADS; i++){ pthread_join(tids[i], NULL); }

for (i = 0; i < NUM_THREADS; i++){ pthread_mutex_destroy(&mutexes[i]); } for (i = 0; i < NUM_THREADS; i++){ pthread_cond_destroy(&conditionVars[i]); }

//printf(" data = %d \n", data);

//assert( data != 201); return 0;}

43

‘Plain run’ of Philosophers

gcc -g -O3 -o nobug examples/Dining3.c -L ./lib -lpthread -lstdc++ -lssl % time nobug

P0 : get F0P0 : get F10P0 : put F1P0 : put F0P1 : get F1P1 : get F21P1 : put F2P1 : put F1P2 : get F0 P2 : get F2 2P2 : put F2 P2 : put F0

real 0m0.075suser 0m0.001ssys 0m0.008s

44

…Buggy Philosophers in PThreads // putdown left fork pthread_mutex_lock(&mutexes[i%NUM_THREADS]); permits[i%NUM_THREADS] = 1; printf("P%d : put F%d \n", i, i%NUM_THREADS); pthread_cond_signal(&conditionVars[i%NUM_THREADS]); pthread_mutex_unlock(&mutexes[i%NUM_THREADS]);

// putdown right fork pthread_mutex_lock(&mutexes[(i+1)%NUM_THREADS]); permits[(i+1)%NUM_THREADS] = 1; printf("P%d : put F%d \n", i, (i+1)%NUM_THREADS); pthread_cond_signal(&conditionVars[(i+1)%NUM_THREADS]); pthread_mutex_unlock(&mutexes[(i+1)%NUM_THREADS]);

return NULL;

}

int main(){ int i;

for (i = 0; i < NUM_THREADS; i++) pthread_mutex_init(&mutexes[i], NULL); for (i = 0; i < NUM_THREADS; i++) pthread_cond_init(&conditionVars[i], NULL); for (i = 0; i < NUM_THREADS; i++) permits[i] = 1;

for (i = 0; i < NUM_THREADS-1; i++){ pthread_create(&tids[i], NULL, Philosopher, (void*)(i) ); }

pthread_create(&tids[NUM_THREADS-1], NULL, Philosopher, (void*)(NUM_THREADS-1) );

for (i = 0; i < NUM_THREADS; i++){ pthread_join(tids[i], NULL); }

for (i = 0; i < NUM_THREADS; i++){ pthread_mutex_destroy(&mutexes[i]); } for (i = 0; i < NUM_THREADS; i++){ pthread_cond_destroy(&conditionVars[i]); }

//printf(" data = %d \n", data);

//assert( data != 201); return 0;}

45

‘Plain run’ of buggy philosopher .. bugs missed by testing

gcc -g -O3 -o buggy examples/Dining3Buggy.c -L ./lib -lpthread -lstdc++ -lssl % time buggy

P0 : get F0P0 : get F10P0 : put F1P0 : put F0P1 : get F1P1 : get F21P1 : put F2P1 : put F1P2 : get F2P2 : get F02P2 : put F0P2 : put F2

real 0m0.084suser 0m0.002ssys 0m0.011s

46

Jiggling Schedule in Buggy Philosopher..




int data = 0;



permits[i%NUM_THREADS] = 0; printf("P%d : get F%d\n", i, i%NUM_THREADS); pthread_mutex_unlock(&mutexes[i%NUM_THREADS]);

nanosleep (0) added here

// pickup right fork pthread_mutex_lock(&mutexes[(i+1)%NUM_THREADS]); while (permits[(i+1)%NUM_THREADS] == 0) { printf("P%d : tryget F%d\n", i, (i+1)%NUM_THREADS); pthread_cond_wait(&conditionVars[(i+1)%NUM_THREADS],&mutexes[(i+1)%NUM_THREADS]); } permits[(i+1)%NUM_THREADS] = 0; printf("P%d : get F%d\n", i, (i+1)%NUM_THREADS); pthread_mutex_unlock(&mutexes[(i+1)%NUM_THREADS]);

//printf("philosopher %d thinks \n",i); printf("%d\n", i);

// data = 10 * data + i;

fflush(stdout);

// putdown right fork pthread_mutex_lock(&mutexes[(i+1)%NUM_THREADS]); permits[(i+1)%NUM_THREADS] = 1; printf("P%d : put F%d\n", i, (i+1)%NUM_THREADS); pthread_cond_signal(&conditionVars[(i+1)%NUM_THREADS]); pthread_mutex_unlock(&mutexes[(i+1)%NUM_THREADS]);

47

‘Plain runs’ of buggy philosopher – bug still very dodgy …

gcc -g -O3 -o buggynsleep examples/Dining3BuggyNanosleep0.c -L ./lib -lpthread -lstdc++ -lssl % buggysleep

P0 : get F0P0 : sleeping 0 nsP1 : get F1P1 : sleeping 0 nsP2 : get F2P2 : sleeping 0 nsP0 : tryget F1P2 : tryget F0P1 : tryget F2

buggysleep

P0 : get F0P0 : sleeping 0 nsP0 : get F10P0 : put F1P0 : put F0P1 : get F1P1 : sleeping 0 nsP2 : get F2P2 : sleeping 0 nsP1 : tryget F2P2 : get F02P2 : put F0P2 : put F2P1 : get F21P1 : put F2P1 : put F1

First run deadlocked – second did not ..

48

Inspect of nonbuggy and buggy Philosophers ..

./instrument file.c

./compile file.instr.c

./inspect ./target

P0 : get F0P0 : get F10P0 : put F1P0 : put F0P1 : get F1P1 : get F21P1 : put F2P1 : put F1P2 : get F0 P2 : get F2 2P2 : put F2 P2 : put F0 num of threads = 1 === run 2 ===P0 : get F0

...

P1 : put F1

…=== run 48 ===P2 : get F0 P2 : get F2 2P2 : put F2 P2 : put F0 P0 : get F0P0 : get F10P1 : tryget F1<< Total number of runs: 48, Transitions explored: 1814Used time (seconds): 7.999327

=== run 1 ===P0 : get F0P0 : get F10P0 : put F1P0 : put F0P1 : get F1P1 : get F21P1 : put F2P1 : put F1P2 : get F2P2 : get F02P2 : put F0P2 : put F2 === run 2 ===P0 : get F0P0 : get F10P0 : put F1P0 : put F0P1 : get F1P1 : get F21P2 : tryget F2P1 : put F2P1 : put F1

=== run 28 ===P0 : get F0P1 : get F1P0 : tryget F1P2 : get F2P1 : tryget F2P2 : tryget F0Found a deadlock!!(0, thread_start) (0, mutex_init, 5) (0, mutex_init, 6) (0, mutex_init, 7) (0, cond_init, 8) (0, cond_init, 9) (0, cond_init, 10) (0, obj_write, 2) (0, obj_write, 3) (0, obj_write, 4) (0, thread_create, 1) (0, thread_create, 2) (0, thread_create, 3) (1, mutex_lock, 5) (1, obj_read, 2) (1, obj_write, 2) (1, mutex_unlock, 5) (2, mutex_lock, 6) (2, obj_read, 3)

(2, obj_write, 3) (2, mutex_unlock, 6) (1, mutex_lock, 6) (1, obj_read, 3) (1, mutex_unlock, 6) (3, mutex_lock, 7) (3, obj_read, 4) (3, obj_write, 4) (3, mutex_unlock, 7) (2, mutex_lock, 7) (2, obj_read, 4) (2, mutex_unlock, 7) (3, mutex_lock, 5) (3, obj_read, 2) (3, mutex_unlock, 5) (-1, unknown)

Total number of runs: 29, killed-in-the-middle runs: 4Transitions explored: 1193Used time (seconds): 5.990523

49

The Growth of (n.p)! / (n!)p for Diningp.c

• Diningp.c has n = 4 (roughly)

• p = 3 : We get 34,650 (loose upper-bound) versus 48 with DPOR

• p = 5 : We get 305,540,235,000 versus 2,375 with DPOR

• DPOR really works well in reducing the number of interleavings !!

• Testing will have to exhibit its cleverness among 3 * 1011 interleavings

50

On the HUGE importance of DPOR[ NEW SLIDE ]

void * thread_A(void* arg){ pthread_mutex_lock(&mutex); A_count++; pthread_mutex_unlock(&mutex);}

void * thread_B(void * arg){ pthread_mutex_lock(&lock); B_count++; pthread_mutex_unlock(&lock);}

BEFORE INSTRUMENTATIONvoid *thread_A(void *arg ) // thread_B is similar{ void *__retres2 ; int __cil_tmp3 ; int __cil_tmp4 ;

{ inspect_thread_start("thread_A"); inspect_mutex_lock(& mutex); __cil_tmp4 = read_shared_0(& A_count); __cil_tmp3 = __cil_tmp4 + 1; write_shared_1(& A_count, __cil_tmp3); inspect_mutex_unlock(& mutex); __retres2 = (void *)0; inspect_thread_end(); return (__retres2);}}

AFTER INSTRUMENTATION (transitions are shown as bands)

51

On the HUGE importance of DPOR[ NEW SLIDE ]



BEFORE INSTRUMENTATIONvoid *thread_A(void *arg ) // thread_B is similar{ void *__retres2 ; int __cil_tmp3 ; int __cil_tmp4 ;

{ inspect_thread_start("thread_A"); inspect_mutex_lock(& mutex); __cil_tmp4 = read_shared_0(& A_count); __cil_tmp3 = __cil_tmp4 + 1; write_shared_1(& A_count, __cil_tmp3); inspect_mutex_unlock(& mutex); __retres2 = (void *)0; inspect_thread_end(); return (__retres2);}}

AFTER INSTRUMENTATION (transitions are shown as bands)

• ONE interleaving with DPOR• 252 = (10!) / (5!)2 without DPOR

52

Obtaining and Running Inspect (Linux) http://www.cs.utah.edu/~yuyang/inspect May need to obtain libssl-dev Need Ocaml-3.10.2 or higher Remove the contents of the “cache directory” autom4te.cache in case “make” loops

bin/instrument file.c bin/compile file.instr.c insp ect –help inspect target inspect –s target

53

Examples Included With TutorialDining3Buggy.c : Initial attempt to write 3 Dining Philosophers. Since the code is symmetric, it has a deadlock. Testing misses it.

Dining3BuggyRace1.c: Initial attempt to tweak the code results in read / write race which Inspect finds (testing misses race + deadlock)

Dining3BuggyRace2.c: Another race is now exposed by Inspect

Dining3BuggyNoRace.c: All races removed. Now testing sometimes finds the deadlock. Inspect always finds it.

Dining3.c: This is the final bug-fixed version.

Dining5.c: Without DPOR, this should generate too many states. With DPOR, the number of states / transitions is far fewer.

sharedArrayRace.c: A shared array program with a race.

sharedArray.c: After fixing the race, stateless search does not finish.We need stateful search to finish.

54

Why DPOR?“Classic POR” often runs into trouble becausedependencies are often known only at runtime…

if odd(a) then a ++

if odd(a) then a ++

b ++

b ++

#define a A -> fld

#define b B -> fld

// A == B could be true or false…

55

DPOR helps enumerate all possible “happens-before” partial orders…



void * thread_C(void * arg){ pthread_mutex_lock(&mutex); A_count-- ; pthread_mutex_unlock(&mutex);}

CONSIDER AN EXECUTION:

pthread_mutex_lock(&mutex); A_count++;pthread_mutex_unlock(&mutex);

pthread_mutex_lock(&lock); B_count++;pthread_mutex_unlock(&lock);

pthread_mutex_lock(&mutex); A_count-- ;pthread_mutex_unlock(&mutex);

56


THE FIRST EXECUTION:




57


THE FIRST EXECUTION:




This partial order (“happens before”)determines the outcomeof verification !

58

Happens-Before is defined by the Transition Dependency Relation

Two transitions t1 and t2 of a concurrent program are dependent, if

– t1 and t2 belong to the same process, OR

» t1 and t2 are concurrently enabled, and

t1, t2 are: lock acquire operations on the same lock operations on the same global object and at least one of them is a write a WAIT and a SIGNAL on the same condition variable

Introduce an HB edge between every pair of

dependent operations in an execution

59


First HAPPENS-BEFORE:




Another “HAPPENS-BEFORE”




60

Other details of DPOR

Happens-before maintained using Vector Clocks

Two transitions are concurrent if– They are not Happens-Before ordered

– They can be executed under disjoint lock-sets

DATA RACE– Two concurrent transitions enabled out of a state

– Both access the same variable and one is a write

61

Computation of “ample” sets in Static POR versus in DPOR

Exploring “Ample” sets at every statesuffices to generateall HB executions

CLASSICAL POR : AMPLE determined when S is reached

S

62




S

DPOR:

This dependency

S

63




S

DPOR:

This dependency helps EXTEND

THIS AMPLE SET !!

S

64


Ample determinedusing “local” criteria

Current State

Next move of Red process

Nearest DependentTransitionLooking Back

Add Red Process to“Backtrack Set”

This builds the Ampleset incrementally based on observed dependencies

Blue is in “Done” set

{ BT }, { Done }

65

We target C/C++ PThread Programs Instrument the given program (largely automated) Run the concurrent program “till the end” Compute dependencies based on concrete run

information present in the runtime stack– This populates the Backtrack Sets -- points at which the execution

must be replayed

When an item (a process ID) is explored from the Backtrack Set, put it in the “done” set

Repeat till all the Backtrack Sets are empty

Putting it all together …

66

A Simple DPOR Example

{}, {}t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)

{ BT }, { Done }

67

t0: lock{}, {}t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)

A Simple DPOR Example{ BT }, { Done }

68

t0: lock

t0: unlock

{}, {}t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


69

t0: lock

t0: unlock

t1: lock

{}, {}t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


70

t0: lock

t0: unlock

t1: lock

{t1}, {t0}t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


71

t0: lock

t0: unlock

t1: lock

t1: unlock

t2: lock

{t1}, {t0}

{}, {}

t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


72

t0: lock

t0: unlock

t1: lock

t1: unlock

t2: lock

{t1}, {t0}

{t2}, {t1}

t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


73

t0: lock

t0: unlock

t1: lock

t2: unlock

t1: unlock

t2: lock

{t1}, {t0}

{t2}, {t1}

t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


74

t0: lock

t0: unlock

t1: lock

t1: unlock

t2: lock

{t1}, {t0}

{t2}, {t1}

t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


75

t0: lock

t0: unlock

{t1}, {t0}

{t2}, {t1}

t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


76

t0: lock

t0: unlock

t2: lock

{t1,t2}, {t0}

{}, {t1, t2}

t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


77

t0: lock

t0: unlock

t2: lock

t2: unlock

{t1,t2}, {t0}

{}, {t1, t2}

t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


…

{ BT }, { Done }

78

t0: lock

t0: unlock

{t1,t2}, {t0}

{}, {t1, t2}

t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


79

{t2}, {t0,t1}t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


80

t1: lock

t1: unlock

{t2}, {t0, t1}t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


…

{ BT }, { Done }

81

Sequential Model Checking Times

Benchmark Threads Runs Time (sec)

fsbench 26 8,192 291.32

indexer 16 32,768 1188.73

aget 6 113,400 5662.96

bbuf 8 1,938,816 39710.43

82

worker a worker b

Request unloading

idle node id

work description

report result

load balancer

We have devised a work-distribution scheme (SPIN 2007)

83

Speedup on aget

84

Speedup on bbuf

85

Stateful Runtime Model Checking (to appear in SPIN 2008)

Method to remember search history (an approximate notion of ‘visited states’ maintained)

Avoiding unsoundness due to cutting off search

Avoiding unsoundness through efficient implementation

86

Recording visited states is hard Capturing the stacks of threads

and the heap at runtime is difficult.– All the state elements shown in this

figure must be recorded!

– A problem not faced by (e.g. JPF) and other “bytecode” based verification tools

– Native-code verification tools such as Inspect are inherently harder to build

Canonizing the heap and comparing pointers among executions are not straightforward.

87

Key observation

The changes between successive local states are often easy to capture.– It is common that the local state of thread does not change between

successive visible operations - δ-epsilon (no change)

– It is common that the local state change only involves a finite number of variables – δ-other (known change)

We can detect visited states among executions by tracking the changes of local states.

Also, Inspect’s “replay” based execution avoids the need to capture states such as process control block– Each time, Inspect recreates these states through replay

– Added bonus: Ease of parallelization

88

Detecting visited states

S1S1

thread 1: δ 1thread 1: δ 1


S2S2

S3S3

(g1, [L1,M1])(g1, [L1,M1])

(g2, [L2,M1])(g2, [L2,M1]) S2S2

thread2: δ 2thread2: δ 2

(g2’, [L1,M2])(g2’, [L1,M2])

(g3, [L2,M2])(g3, [L2,M2])

a visited state!a visited state!

S3S3


(g3, [L2,M2])(g3, [L2,M2])

Key idea :

• Local state changes are classified into

• δ-epsilon (no change)

• δ-bottom (unknown change)

• δ-other (known change)

• Uniquely name each non δ-bottom sequence of each thread

• May miss detecting revisits if δ1 o δ2 = δ2 o δ1

• Cheaply maintains local state info, and often detects revisits

IDs of local states held in thread-local hash tables

89


S1S1 (g1, [L1,M1])(g1, [L1,M1]) thread 1:

….

thread 1:

….

thread 2:

….

thread 2:

….

90


S1S1 (g1, [L1,M1])(g1, [L1,M1]) thread 1:

….

thread 1:

….

thread 2:

….

thread 2:

….

91


S1S1


S2S2

(g1, [L1,M1])(g1, [L1,M1])

(g2, [L2,M1])(g2, [L2,M1])

thread 1:

L1+δ1 L2

…

thread 1:

L1+δ1 L2

…

thread 2:

…

thread 2:

…

92


S1S1



S2S2

S3S3

(g1, [L1,M1])(g1, [L1,M1])

(g2, [L2,M1])(g2, [L2,M1])

(g3, [L2,M2])(g3, [L2,M2])

thread 1:

L1+δ1 L2

….

thread 1:

L1+δ1 L2

….

thread 2:

M1+δ2 M2

….

thread 2:

M1+δ2 M2

….

93


S1S1



S2S2

S3S3

(g1, [L1,M1])(g1, [L1,M1])

(g2, [L2,M1])(g2, [L2,M1])

(g3, [L2,M2])(g3, [L2,M2])

thread 1:

L1+δ1 L2

….

thread 1:

L1+δ1 L2

….

thread 2:

M1+δ2 M2

….

thread 2:

M1+δ2 M2

….

94


S1S1



S2S2

S3S3

(g1, [L1,M1])(g1, [L1,M1])

(g2, [L2,M1])(g2, [L2,M1])

(g3, [L2,M2])(g3, [L2,M2])

thread 1:

L1+δ1 L2

….

thread 1:

L1+δ1 L2

….

thread 2:

M1+δ2 M2

….

thread 2:

M1+δ2 M2

….

95


S1S1



S2S2

S3S3

(g1, [L1,M1])(g1, [L1,M1])

(g2, [L2,M1])(g2, [L2,M1]) S2S2


(g2’, [L1,M2])(g2’, [L1,M2])

(g3, [L2,M2])(g3, [L2,M2])

thread 1:

L1+δ1 L2

….

thread 1:

L1+δ1 L2

….

thread 2:

M1+δ2 M2

….

thread 2:

M1+δ2 M2

….

96


S1S1



S2S2

S3S3

(g1, [L1,M1])(g1, [L1,M1])

(g2, [L2,M1])(g2, [L2,M1]) S2S2


(g2’, [L1,M2])(g2’, [L1,M2])

(g3, [L2,M2])(g3, [L2,M2])

a visited state!a visited state!

S3S3


(g3, [L2,M2])(g3, [L2,M2])

thread 1:

L1+δ1 L2

….

thread 1:

L1+δ1 L2

….

thread 2:

M1+δ2 M2

….

thread 2:

M1+δ2 M2

….

97

Making Stateful DPOR Work Soundly and Efficiently

visited state

visited state

Stop here and backtrack?

Stop here and backtrack?

98

Naïve backtracking does not work!

dependentdependentvisited state

visited state

This part will not be traversedif backtrack naively!

This part will not be traversedif backtrack naively!

99

A quick fix on this problem

visited state

visited state

When a visited state is found, for states in the search stack, add all enabled transitions into the backtrack set

Problem with this fix– redundant backtrack

points

100

Our solution

Observation: the number of visible operations that threads can execute are usually small in number!

Solution: compute the summary of the sub-space using transition dependency graph

101

Our solution: Maintain visible operation dependency graph, and fill backtrack set only according to it…

visited state

visited state

Visible operation dependency graph

Visible operation dependency graph

102

Evaluation Two (realistic) benchmarks

– pfscan -- a parallel file scanner

– bzip2smp – a parallel file compressor

103

Evaluation

benchmark threads DPOR SDPOR

runs transitions time runs transitions time

bzip2smp 4 - - - 4,598 26.442 1311.15

bzip2smp 5 - - - 18,709 92,276 9546.34

bzip2smp 6 - - - 51,400 236,863 25659.4

pfscan 3 84 1,157 0.53 71 967 0.49

pfscan 4 13,617 189,218 240.74 3,168 40,395 57.43

pfscan 5 - - - 272,873 3,402,486 5328.84

104

RESULTS pertaining to ISP(to be presented at CAV 2008)

105

ISP is an entirely separate project… http://www.cs.utah.edu/formal_verification/ISP_Tests

MPI ProgramMPI Program

Simplified MPI

Program

Simplified MPI

Program

Simplifications

Actual MPI Library and Runtime

Actual MPI Library and Runtime

executableexecutable

Proc 1

Proc n

schedulerrequest/permit

request/permitcompile

PMPI callsOPERATION of ISP

106

MPI Program Verification Work prior to ISP

Siegel and Avrunin, Siegel – MPI programs modeled in Promela

» Models built by hand

– MPI-SPIN employs some “MPI-aware” reductions

» Version of SPIN with C functions serving to capture MPI

– Symbolic execution to compare sequential and concurrent algorithms

» Three precision levels of comparison

Efforts based on static analysis– Vuduc, Quinlan, de Supinski, Dwyer, Hoveland, …

» the usual attributes of static analysis apply

Dynamic Execution based Verification ESSENTIAL for MPI– Need to examine code-paths in MPI and user-level libraries

– Bugs may be in the “surrounding code”

– Dynamic reductions can dramatically reduce # of interleavings

– MPI programs often compute many things (communicators, send targets..)

107

Summary of ISP Dynamic verification of MPI programs suffers from

– the inability to externally control how the MPI runtime performs message matches for wildcard receive statements

– the inability to force a desired alternative execution

ISP overcomes these problems by– Exploiting MPI’s out-of-order semantics

» If the runtime does not follow certain program orderings, then one can afford to postpone the issue of certain MPI operations

» Such postponement allows one to determine the maximal set of message sends that can ever match with a receive

later, we show that this is also the basis of a barrier removal algorithm

– ISP execution strategy guarantees AMPLE sets at every point

Dealing with actual concurrent program runtimes in a DPOR approach is a growing reality to be confronted

108

ISP Results Found deadlocks missed by some other existing tools

– Testing tools such as Marmot, MPICH run from the terminal

Could finish examining some large benchmarks– Game of Life example (500+ lines of code)– “Lines of code” is an unfamiliar metric for MPI

» even 4 lines of code can be “hard”

This level of coverage unattainable through “testing”, in the presence of – non-determinism

» Never known statically whether code is deterministic

– too many processes» Blind interleaving can kill any testing strategy

Full table of results presented at http://www.cs.utah.edu/formal_verification/ISP_Tests

109

How testing can miss error1

P0---

MPI_Send(to P1…);

MPI_Send(to P1, data=22);

P1---

MPI_Recv(from P0…);


MPI_Recv(*, x);

if (x==22) then error1 else MPI_Recv(*, x);

P2---

MPI_Send(to P1…);


110

How testing can miss error1 (1)

P0---

MPI_Send(to P1…);


P1---



MPI_Recv(*, x);


P2---

MPI_Send(to P1…);


111


P0---

MPI_Send(to P1…);


P1---



MPI_Recv(*, x);


P2---

MPI_Send(to P1…);


112


P0---

MPI_Send(to P1…);


P1---



MPI_Recv(*, x);


P2---

MPI_Send(to P1…);


113


P0---

MPI_Send(to P1…);


P1---



MPI_Recv(*, x);


P2---

MPI_Send(to P1…);


114


P0---

MPI_Send(to P1…);


P1---



MPI_Recv(*, x);


P2---

MPI_Send(to P1…);


unlucky

115


P0---

MPI_Send(to P1…);


P1---



MPI_Recv(*, x);


P2---

MPI_Send(to P1…);


lucky

116

How ISP efficiently catches error1

P0---

MPI_Send(to P1…);


P1---



MPI_Recv(*, x);


P2---

MPI_Send(to P1…);


Avoid un-necessary interleavings here, thanks to POR

117

How ISP efficiently catches error1

P0---

MPI_Send(to P1…);


P1---



MPI_Recv(*, x);


P2---

MPI_Send(to P1…);


Avoid un-necessary interleavings here, thanks to POR

Consider both matches here, thanks to dynamic rewrite and recursive expansion

118

Illustration of the POE AlgorithmConventions: encountered, rewritten, fired

P0---

MPI_Irecv(*, &req);

MPI_Barrier();

MPI_Wait(&req);

MPI_Recv(from 2);

P1---

MPI_Barrier();

MPI_Isend(to 0, &req);

MPI_Wait(&req);

P2---


MPI_Barrier();

MPI_Wait(&req);

119


P0---

MPI_Irecv(*, &req);

MPI_Barrier();

MPI_Wait(&req);

MPI_Recv(from 2);

P1---

MPI_Barrier();


MPI_Wait(&req);

P2---


MPI_Barrier();

MPI_Wait(&req);

120


P0---

MPI_Irecv(*, &req);

MPI_Barrier();

MPI_Wait(&req);

MPI_Recv(from 2);

P1---

MPI_Barrier();


MPI_Wait(&req);

P2---


MPI_Barrier();

MPI_Wait(&req);

121


P0---

MPI_Irecv(*, &req);

MPI_Barrier();

MPI_Wait(&req);

MPI_Recv(from 2);

P1---

MPI_Barrier();


MPI_Wait(&req);

P2---


MPI_Barrier();

MPI_Wait(&req);

Is a fence, hence switch over to P1…

122


P0---

MPI_Irecv(*, &req);

MPI_Barrier();

MPI_Wait(&req);

MPI_Recv(from 2);

P1---

MPI_Barrier();


MPI_Wait(&req);

P2---


MPI_Barrier();

MPI_Wait(&req);

123


P0---

MPI_Irecv(*, &req);

MPI_Barrier();

MPI_Wait(&req);

MPI_Recv(from 2);

P1---

MPI_Barrier();


MPI_Wait(&req);

P2---


MPI_Barrier();

MPI_Wait(&req);

Is a fence, hence switch over to P2…

124


P0---

MPI_Irecv(*, &req);

MPI_Barrier();

MPI_Wait(&req);

MPI_Recv(from 2);

P1---

MPI_Barrier();


MPI_Wait(&req);

P2---


MPI_Barrier();

MPI_Wait(&req);

Likewise, do it for P2 (multiple steps shown here)…

125


P0---

MPI_Irecv(*, &req);

MPI_Barrier();

MPI_Wait(&req);

MPI_Recv(from 2);

P1---

MPI_Barrier();


MPI_Wait(&req);

P2---


MPI_Barrier();

MPI_Wait(&req);

All processes have hit a fence

Now form Match Sets - you’ll see which are the match sets by those actions turning red!

Issue Match Sets in priority order

126


P0---

MPI_Irecv(*, &req);

MPI_Barrier();

MPI_Wait(&req);

MPI_Recv(from 2);

P1---

MPI_Barrier();


MPI_Wait(&req);

P2---


MPI_Barrier();

MPI_Wait(&req);

First priority match-sets are Barriers

Their ancestors have fired (they have no ancestors!)

SO WE CAN LET THEM FIRE

127


P0---

MPI_Irecv(*, &req);

MPI_Barrier();

MPI_Wait(&req);

MPI_Recv(from 2);

P1---

MPI_Barrier();


MPI_Wait(&req);

P2---


MPI_Barrier();

MPI_Wait(&req);

Now all procs are not at a fence, and so the execution has to advance each process to the next fence – here we show multiple suchsteps leading to the next fence…

128


P0---

MPI_Irecv(*, &req);

MPI_Barrier();

MPI_Wait(&req);

MPI_Recv(from 2);

P1---

MPI_Barrier();


MPI_Wait(&req);

P2---


MPI_Barrier();

MPI_Wait(&req);

Now, the only eligible ops whose ancestors have fired are shown in blue

129


P0---

MPI_Irecv(*, &req);

MPI_Barrier();

MPI_Wait(&req);

MPI_Recv(from 2);

P1---

MPI_Barrier();


MPI_Wait(&req);

P2---


MPI_Barrier();

MPI_Wait(&req);

They also happen to be the ones needing a dynamicrewrite into specific receives… we will show TWO cases now and then pursue one of the cases.

130


P0---

MPI_Irecv(from 1, &req);

MPI_Barrier();

MPI_Wait(&req);

MPI_Recv(from 2);

P1---

MPI_Barrier();


MPI_Wait(&req);

P2---


MPI_Barrier();

MPI_Wait(&req);

Case 1

131


P0---


MPI_Barrier();

MPI_Wait(&req);

MPI_Recv(from 2);

P1---

MPI_Barrier();


MPI_Wait(&req);

P2---


MPI_Barrier();

MPI_Wait(&req);

Case 2

132


P0---


MPI_Barrier();

MPI_Wait(&req);

MPI_Recv(from 2);

P1---

MPI_Barrier();


MPI_Wait(&req);

P2---


MPI_Barrier();

MPI_Wait(&req);

Pursuing Case 1, we get this…

133


P0---


MPI_Barrier();

MPI_Wait(&req);

MPI_Recv(from 2);

P1---

MPI_Barrier();


MPI_Wait(&req);

P2---


MPI_Barrier();

MPI_Wait(&req);

And then this…

134


P0---


MPI_Barrier();

MPI_Wait(&req);

MPI_Recv(from 2);

P1---

MPI_Barrier();


MPI_Wait(&req);

P2---


MPI_Barrier();

MPI_Wait(&req);

And then this…

135


P0---


MPI_Barrier();

MPI_Wait(&req);

MPI_Recv(from 2);

P1---

MPI_Barrier();


MPI_Wait(&req);

P2---


MPI_Barrier();

MPI_Wait(&req);

And then this…

136


P0---


MPI_Barrier();

MPI_Wait(&req);

MPI_Recv(from 2);

P1---

MPI_Barrier();


MPI_Wait(&req);

P2---


MPI_Barrier();

MPI_Wait(&req);

And finally this!

137

RESULTS pertaining to FIB

138

Summary of FIB MPI_Barrier() calls within an MPI program are

employed to constrain executions– always more executions without a barrier than with

Barrier uses and desired checks– Used to streamline I / O

» Barrier had better be FUNCTIONALLY IRRELEVANT– To prevent certain message matches from occurring

» Barriers had better be FUNCTIONALLY RELEVANT

Hitherto no algorithm to verify (for all program inputs) whether a barrier is an FIB or a FRB

We offer an algorithm (called “Fib”) that finds all FIBs for a given input

Through static analysis, we can sometimes extrapolate this result to cover ALL possible inputs– for MPI programs that do not have “data dependent control flows”

139

Fib Overview – is this barrier relevant ?

P0---

MPI_Irecv(*, &req);

MPI_Wait(&req);

MPI_Barrier();

MPI_Finalize();

P1---

MPI_Isend(to 0, 33);

MPI_Barrier();

MPI_Finalize();

P2---

MPI_Barrier();

MPI_Isend(to P0, 22);

MPI_Finalize();

140

IntraCB Edges (how much program order maintained in executions)

P0---

MPI_Irecv(*, &req);

MPI_Wait(&req);

MPI_Barrier();

MPI_Finalize();

P1---


MPI_Barrier();

MPI_Finalize();

P2---

MPI_Barrier();


MPI_Finalize();

141

IntraCB (implied transitivity)

P0---

MPI_Irecv(*, &req);

MPI_Wait(&req);

MPI_Barrier();

MPI_Finalize();

P1---


MPI_Barrier();

MPI_Finalize();

P2---

MPI_Barrier();


MPI_Finalize();

142

InterCB introduction: for any x,y in a match set, add InterCB from x to every IntraCB successor of y

P0---

MPI_Irecv(*, &req);

MPI_Wait(&req);

MPI_Barrier();

MPI_Finalize();

P1---


MPI_Barrier();

MPI_Finalize();

P2---

MPI_Barrier();


MPI_Finalize();

143


P0---


MPI_Wait(&req);

MPI_Barrier();

MPI_Finalize();

P1---


MPI_Barrier();

MPI_Finalize();

P2---

MPI_Barrier();


MPI_Finalize();

Match set formedduring POE

144


P0---


MPI_Wait(&req);

MPI_Barrier();

MPI_Finalize();

P1---


MPI_Barrier();

MPI_Finalize();

P2---

MPI_Barrier();


MPI_Finalize();

Match set formedduring POE

InterCB

145


P0---


MPI_Wait(&req);

MPI_Barrier();

MPI_Finalize();

P1---


MPI_Barrier();

MPI_Finalize();

P2---

MPI_Barrier();


MPI_Finalize();

InterCB

146


P0---


MPI_Wait(&req);

MPI_Barrier();

MPI_Finalize();

P1---


MPI_Barrier();

MPI_Finalize();

P2---

MPI_Barrier();


MPI_Finalize();

InterCB

InterCB

147

Continue adding InterCB as the execution advancesHere, we pick the Barriers to be the match set next…

P0---


MPI_Wait(&req);

MPI_Barrier();

MPI_Finalize();

P1---


MPI_Barrier();

MPI_Finalize();

P2---

MPI_Barrier();


MPI_Finalize();

InterCB

InterCB

148

Continue adding InterCB as the execution advancesHere, we pick the Barriers to be the match set next…

P0---


MPI_Wait(&req);

MPI_Barrier();

MPI_Finalize();

P1---


MPI_Barrier();

MPI_Finalize();

P2---

MPI_Barrier();


MPI_Finalize();

InterCB

InterCB

149

… newly added InterCBs (only some of them shown…)

P0---


MPI_Wait(&req);

MPI_Barrier();

MPI_Finalize();

P1---


MPI_Barrier();

MPI_Finalize();

P2---

MPI_Barrier();


MPI_Finalize();

InterCB

InterCB

InterCB

InterCB

150

Now the question pertains to what was a wild-card receive and a potential sender that could have matched…

P0---

MPI_Irecv(was *, &req);

MPI_Wait(&req);

MPI_Barrier();

MPI_Finalize();

P1---


MPI_Barrier();

MPI_Finalize();

P2---

MPI_Barrier();


MPI_Finalize();

InterCB

InterCB

InterCB

InterCB

151

If they are ordered by a Barrier and NO OTHER OPERATION, then the Barrier is RELEVANT…

P0---


MPI_Wait(&req);

MPI_Barrier();

MPI_Finalize();

P1---


MPI_Barrier();

MPI_Finalize();

P2---

MPI_Barrier();


MPI_Finalize();

InterCB

InterCB

InterCB

InterCB

152


P0---


MPI_Wait(&req);

MPI_Barrier();

MPI_Finalize();

P1---


MPI_Barrier();

MPI_Finalize();

P2---

MPI_Barrier();


MPI_Finalize();

InterCB

InterCB

InterCB

InterCB

153


P0---


MPI_Wait(&req);

MPI_Barrier();

MPI_Finalize();

P1---


MPI_Barrier();

MPI_Finalize();

P2---

MPI_Barrier();


MPI_Finalize();

InterCB

InterCB

InterCB

InterCB

154


P0---


MPI_Wait(&req);

MPI_Barrier();

MPI_Finalize();

P1---


MPI_Barrier();

MPI_Finalize();

P2---

MPI_Barrier();


MPI_Finalize();

InterCB

InterCB

InterCB

InterCB

155

In this example, the Barrier is relevant !!

P0---


MPI_Wait(&req);

MPI_Barrier();

MPI_Finalize();

P1---


MPI_Barrier();

MPI_Finalize();

P2---

MPI_Barrier();


MPI_Finalize();

InterCB

InterCB

InterCB

InterCB

156

To flag something FIB (irrelevant), it has to remain irrelevant over all POE-reduced Interleavings !!

P0---


MPI_Wait(&req);

MPI_Barrier();

MPI_Finalize();

P1---


MPI_Barrier();

MPI_Finalize();

P2---

MPI_Barrier();


MPI_Finalize();

InterCB

InterCB

InterCB

InterCB

157

Summary of Fib

Algorithm has been implemented within the ISP tool

Very low overhead (so can keep it turned ‘on’ always…)

Identified FIBs and FRBs in many examples – manually checked correctness of identification

Simple static analysis facility has helped extend this claim over ALL external drivers

Future work : extend reach of this static analysis method to be able to claim FIB over all possible inputs

158

Concluding Remarks

We are “sold” on the merits of dynamic analysis– Gives a sense of realism

– Can give designers “debugger-like” interfaces

» yet “verifier-like” coverages

Going Forward– Bug-preserving scaling methods are essential to develop

» For MPI, OpenMP, …

– Collaboration between API designers and verification tool builders

» Make APIs easier to use in a “verification mode”

» Keep “verification mode” and “execution mode” semantics in agreement

» The plethora of concurrency APIs seem to require early attention to such a “verification mode API”

– Static and Dynamic Analysis can work with synergy

159

Extra slides

160

Looking Further Ahead: Need to clear “idea log-jam in multi-core computing…”

“There isn’t such a thing as Republican clean air or Democratic clean air. We all breathe the same air.”

There isn’t such a thing as an architectural-only solution, or a compilers-only solution to future problems in multi-core computing…

161

Now you see it; Now you don’t !

On the menace of non reproducible bugs.

Deterministic replay must ideally be an option User programmable schedulers greatly

emphasized by expert developers Runtime model-checking methods with state-

space reduction holds promise in meshing with current practice…

162

Computing Ample Sets (basic idea)

Ideal situation:– No path via the green triangle will ever “wake-

up” the disabled red transitions

» If this can be established, we can avoid interleaving the greens with the blues…

If dependence cannot be precisely computed, we will interleave the greens with the blues – too many such interleavings!

Some arbitrarytransition t

t’s dependency closure

Disabled dependents(wrt t)

Enabled independents(wrt t)

The transitions goingout of a state S (belongingto different processes) can bedivided into three groups…

S

Documents

Inspect, ISP, and FIB: reduction-based verification and analysis tools for concurrent programs Research Group: Yu Yang, Xiaofang Chen, Sarvani Vakkalanka,