1 Distributed Dynamic Partial Order Reduction based Verification of Threaded Software Yu Yang (PhD student; summer intern at CBL) Xiaofang Chen (PhD student;

1

Distributed Dynamic Partial Order Reduction based Verification of Threaded Software

Yu Yang (PhD student; summer intern at CBL)

Xiaofang Chen (PhD student; summer intern at IBM)

Ganesh GopalakrishnanRobert M. Kirby

School of ComputingUniversity of Utah

SPIN 2007 Workshop Presentation

Supported by: Microsoft HPC Institutes

NSF CNS 0509379

2

Thread Programming will become more prevalent

FV of thread programs will grow in importance

3

Why FV for Threaded Programs

> 80% of chipsshipped will bemulti-core

(photo courtesy of

Intel Corporation.)

4

Model Checking will Increasingly be thru Dynamic Methods

Also known as Runtime or In-Situ methods

5

Why Dynamic Verification Methods

• Even after early life-cycle modeling and validation, the final code will have far more details

• Early life-cycle modeling is often impossible- Use of libraries (API) such as MPI, OpenMP, Shmem, …

- Library function semantics can be tricky

- The bug may be in the library function implementation

6

Model Checking will often be “stateless”

7

Why Stateless

• One may not be able to access a lot of the state

- e.g. state of the OS

. It is expensive to hash and lookup revisits

. Stateless is easier to parallelize

8

Partial Order Reduction is Crucial !

9

Why POR?

Process P0:-------------------------------0: MPI_Init1: MPI_Win_lock2: MPI_Accumulate3: MPI_Win_unlock4: MPI_Barrier5: MPI_Finalize

Process P1:-------------------------------0: MPI_Init1: MPI_Win_lock2: MPI_Accumulate3: MPI_Win_unlock4: MPI_Barrier5: MPI_Finalize

ONLYDEPENDENTOPERATIONS

• 504 interleavings without POR (2 * (10!)) / (5!)^2• 2 interleavings with POR !!

10

Dynamic POR is almost a “must” !

( Dynamic POR as in Flanagan and Godefroid, POPL 2005)

11

Why Dynamic POR ?

a[ j ]++ a[ k ]--

• Ample Set depends on whether j == k

• Can be very difficult to determine statically

• Can determine dynamically

12

Why Dynamic POR ?

The notion of action dependence (crucial to POR methods) is a function of the execution

13

Computation of “ample” sets in Static POR versus in DPOR

Ample determinedusing “local” criteria

Current State

Next move of Red process

Nearest DependentTransitionLooking Back

Add Red Process to“Backtrack Set”

This builds the Ampleset incrementally based on observed dependencies

Blue is in “Done” set

{ BT }, { Done }

14

We target C/C++ PThread Programs Instrument the given program (largely automated) Run the concurrent program “till the end” Record interleaving variants while advancing When # recorded backtrack points reaches a soft

limit, spill work to other nodes In one larger example, a 11-hour run was finished in

11 minutes using 64 nodes

Heuristic to avoid recomputations was essential for speed-up. First known distributed DPOR

Putting it all together …

15

A Simple DPOR Example

{}, {}t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)

16

t0: lock{}, {}t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


17

t0: lock

t0: unlock

{}, {}t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


18

t0: lock

t0: unlock

t1: lock

{}, {}t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


19

t0: lock

t0: unlock

t1: lock

{t1}, {t0}t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


20

t0: lock

t0: unlock

t1: lock

t1: unlock

t2: lock

{t1}, {t0}

{}, {}

t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


21

t0: lock

t0: unlock

t1: lock

t1: unlock

t2: lock

{t1}, {t0}

{t2}, {t1}

t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


22

t0: lock

t0: unlock

t1: lock

t2: unlock

t1: unlock

t2: lock

{t1}, {t0}

{t2}, {t1}

t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


23

t0: lock

t0: unlock

t1: lock

t1: unlock

t2: lock

{t1}, {t0}

{t2}, {t1}

t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


24

t0: lock

t0: unlock

{t1}, {t0}

{t2}, {t1}

t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


25

t0: lock

t0: unlock

t2: lock

{t1,t2}, {t0}

{}, {t1, t2}

t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


26

t0: lock

t0: unlock

t2: lock

t2: unlock

{t1,t2}, {t0}

{}, {t1, t2}

t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


…

27

t0: lock

t0: unlock

{t1,t2}, {t0}

{}, {t1, t2}

t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


28

{t2}, {t0,t1}t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


29

t1: lock

t1: unlock

{t2}, {t0, t1}t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)


…

30

For this example, all the paths explored during DPOR

For others, it will be a proper subset

31

Idea for parallelization: Explore computations from the backtrack set in other processes.

“Embarrassingly Parallel” – it seems so, anyway !

32

We first built a sequential DPOR explorer for C / Pthreads programs, called “Inspect”

Multithreaded C/C++ program

Multithreaded C/C++ program

instrumented program

instrumented program

instrumentation

Thread library wrapper

Thread library wrapper

compile

executableexecutable

thread 1

thread n

schedulerrequest/permit

request/permit

33

Stateless search does not maintain search history Different branches of an acyclic space can be

explored concurrently Simple master-slave scheme can work here

– one load balancer + workers

We then made the following observations

34

worker a worker b

Request unloading

idle node id

work description

report result

load balancer

We then devised a work-distribution scheme…

35

We got zero speedup! Why?

Deeper investigation revealed that multiple nodes

ended up exploring the same interleavings

36

Illustration of the problem (1 of 5)

t0: lock

t0: unlock

t1: lock

t2: unlock

t1: unlock

t2: lock

{t1}, {t0}

{t2}, {t1}

37


t0: lock

t0: unlock

t1: lock

t2: unlock

t1: unlock

t2: lock

{t1}, {t0}

{t2}, {t1}

Heuristic : Handoff DEEPEST backtrack point for another node to explore

Reason : Largest number of paths emanate from there

To Node 1

38

Detail of (2 of 5)

t0: lock

t0: unlock

t1: lock

t2: unlock

t1: unlock

t2: lock

{t1}, {t0}

{t2}, {t1}

Node 0

t0: lock

t0: unlock

t1: lock

t2: unlock

t1: unlock

t2: lock

{ }, {t0,t1}

{t2}, {t1}

39

Detail of (2 of 5)

t0: lock

t0: unlock

t1: lock

t2: unlock

t1: unlock

t2: lock

{t1}, {t0}

{t2}, {t1}

Node 1Node 0

t0: lock

t0: unlock

t1: lock

t2: unlock

t1: unlock

t2: lock

{ }, {t0,t1}

{t2}, {t1}

t0: lock{t1}, {t0}

40

Detail of (2 of 5)

t0: lock

t0: unlock

t1: lock

t2: unlock

t1: unlock

t2: lock

{t1}, {t0}

{t2}, {t1}

Node 1Node 0

t0: lock

t0: unlock

t1: lock

t2: unlock

t1: unlock

t2: lock

{ }, { t0,t1 }

{t2}, {t1}

t0: lock{ t1 }, {t0}

t1 is forced into DONE set before workhanded to Node 1

Node 1 keeps t1 in backtrack set

41


t0: lock

t0: unlock

t1: lock

t2: unlock

t1: unlock

t2: lock

{t1}, {t0}

{t2}, {t1}

To Node 1

Decide to do THIS work at Node 0 itself…

42

t0: lock

t0: unlock

{}, {t0,t1}

{t2}, {t1}

{t1}, {t0}


Being expanded by Node 0

Being expanded by Node 1

43


t0: lock

t0: unlock

{t2}, {t0,t1}

{}, {t2}t2: lock

t2: unlockt2: unlock

44


t0: lock

t0: unlock

{t2}, {t0,t1}

{}, {t2}

{t1}, {t0}t1: lock

t1: unlock

t2: lock


45


t0: lock

t0: unlock

{t2}, {t0,t1}

{}, {t2}

{t2}, {t0, t1}t1: lock

t1: unlock

t2: lock


t2: lock


{}, {t2}

Redundancy!

46

New Backtrack Set Computation: Aggressively mark up the stack!

t0: lock

t0: unlock

t1: lock

t2: unlock

t1: unlock

t2: lock

{t1,t2}, {t0}

{t2}, {t1}

Update the backtrack sets of

ALL dependent operations! Forms a good allocation scheme Does not involve any synchronizations Redundant work may still be performed Likelihood is reduced because a node

aggressively “owns” one operation and

all its dependants

47

Implementation and Evaluation

Using MPI for communication among nodes Did experiments on a 72-node cluster

– 2.4 GHz Intel XEON process, 2GB memory/node

– Two (small) benchmarks

Indexer & file system benchmark used in Flanagan and Godefoid’s DPOR paper

– Aget -- a multithreaded ftp client

– Bbuf – an implementation of bounded buffer

48

Sequential Checking Time

Benchmark Threads Runs Time (sec)

fsbench 26 8,192 291.32

indexer 16 32,768 1188.73

aget 6 113,400 5662.96

bbuf 8 1,938,816 39710.43

49

Speedup on indexer & fs (small exs);so diminishing returns > 40 nodes…

50

Speedup on aget

51

Speedup on bbuf

52

Conclusions and Future Work

Method described is VERY promising We have an in-situ model checker for MPI programs

also! (EuroPVM / MPI 2007)– Will be parallelized using MPI for work distribution!

The C/PThread Work needs to be pushed a lot more:– Automate Instrumentation

– Try many new examples

– Improve work-distribution heuristic in response to findings

– Release tool

53

Questions?

54

Answers !

Properties: Currently – Local “assert”s

– Deadlocks

– Uninitialized Variables

No plans for liveness

Tool release likely in 6 months

That is a very good question. Let’s talk!

55

Extra Slides

56

Concurrent operations on some database

Class A operations:

pthread_mutex_lock(mutex); a_count++;if (a_count == 1) pthred_mutex_lock(res);pthread_mutex_unlock(mutex); …pthread_mutex_lock(mutex);a_count--;if (a_count == 0) pthread_mutex_unlock(res);pthread_mutex_unlock(mutex);

Class B operations:

pthread_mutex_lock(mutex);b_count++;if (b_count == 1) pthred_mutex_lock(res);pthread_mutex_unlock(mutex); …pthread_mutex_lock(mutex);b_count--;if (b_count == 0) pthread_mutex_unlock(res);pthread_mutex_unlock(mutex);

57

Initial random execution

a1 : acquire mutexa2 : a_count + +a3 : a_count == 1a4 : acquire resa5 : release mutexa6 : acquire mutexa7 : a_count a8 : a_count == 0a9 : release resa10 : release mutexb1 : acquire mutexb2 : b_count + +b3 : b_count == 1b4 : acquire resb5 : release mutexb6 : acquire mutexb7 : b_count b8 : b_count == 0b9 : release lockb10 : release mutex

Class A operations:


58



Class A operations:


59



Class A operations:


60



Class A operations:


61



Class A operations:


62



Class A operations:


63


a1 : acquire mutexa2 : a_count + +a3 : a_count == 1a4 : acquire resa5 : release mutexa6 : acquire mutexa7 : a_count --a8 : a_count == 0a9 : release resa10 : release mutexb1 : acquire mutexb2 : b_count + +b3 : b_count == 1b4 : acquire resb5 : release mutexb6 : acquire mutexb7 : b_count b8 : b_count == 0b9 : release lockb10 : release mutex

Class A operations:


64



Class A operations:


65



Class A operations:


66



Class A operations:


67



Class B operations:


68


a1 : acquire mutexa2 : a_count + +a3 : a_count == 1a4 : acquire resa5 : release mutexa6 : acquire mutexa7 : a_count-- a8 : a_count == 0a9 : release resa10 : release mutexb1 : acquire mutexb2 : b_count + +b3 : b_count == 1b4 : acquire resb5 : release mutexb6 : acquire mutexb7 : b_count b8 : b_count == 0b9 : release lockb10 : release mutex

Class B operations:


69



Class B operations:


70

Dependent operations?


Class B operations:


71

Start an alternative execution


Class A operations:


72

Get a deadlock!

a1 : acquire mutexa2 : a_count + +a3 : a_count == 1a4 : acquire resa5 : release mutexb1 : acquire mutexb2 : b_count + +b3 : b_count == 1a6 : acquire mutexa7 : a_count --a8 : a_count == 0a9 : release resa10 : release mutexb4 : acquire resb5 : release mutexb6 : acquire mutexb7 : b_count b8 : b_count == 0b9 : release lockb10 : release mutex

Class A operations:

pthread_mutex_lock(mutex); a_count++;if (a_count == 1) pthred_mutex_lock(res);pthread_mutex_unlock(mutex);pthread_mutex_lock(mutex);

Class B operations:

pthread_mutex_lock(mutex);b_count++;if (b_count == 1) pthred_mutex_lock(res);

Documents

1 Distributed Dynamic Partial Order Reduction based Verification of Threaded Software Yu Yang (PhD student; summer intern at CBL) Xiaofang Chen (PhD student;