A Shape Analysis for Optimizing Parallel Graph Programs Dimitrios Prountzos 1 Keshav Pingali 1,2 Roman Manevich 2 Kathryn S. McKinley 1 1: Department of

A Shape Analysis for Optimizing Parallel Graph Programs

Dimitrios Prountzos1

Keshav Pingali1,2

Roman Manevich2

Kathryn S. McKinley1

1: Department of Computer Science, The University of Texas at Austin2: Institute for Computational Engineering and Sciences, The

University of Texas at Austin

2

Motivation

• Graph algorithms are ubiquitous

• Goal: Compiler analysis for optimization of parallel graph algorithms

Computational biology Social Networks

Computer Graphics

3

Organization• Parallelization of graph algorithms in Galois system

– Speculative execution– Example: Boruvka MST algorithm

• Optimization opportunities– Reduce speculation overheads– Analysis problem: LockSet shape analysis

• Lockset shape analysis – Abstract Data Type (ADT) modeling– Hierarchy summarization abstraction– Predicate discovery

• Evaluation– Fast and infers all available optimizations– Optimizations give speedup up to 12x

4

Boruvka’s Minimum Spanning Tree Algorithm

Build MST bottom-uprepeat { pick arbitrary node ‘a’ merge with lightest neighbor ‘lt’ add edge ‘a-lt’ to MST} until graph is a single node

c d

a b

e f

g

2 4

6

5

3

7

4

1d

a,c b

e f

g

4

6

3

4

17

lt

5

• Algorithm = repeated application of operator to graph

– Active node: • Node where computation is needed

– Activity: • Application of operator to active node

– Neighborhood:• Sub-graph read/written to perform activity

– Unordered algorithms: • Active nodes can be processed in any order

• Amorphous data-parallelism– Parallel execution of activities, subject to neighborhood constraints

• Neighborhoods are functions of runtime values – Parallelism cannot be uncovered at compile time in general

Parallelism in Boruvka

i1

i2

i3

6

Optimistic Parallelization in Galois• Programming model

– Client code has sequential semantics– Library of concurrent data structures

• Parallel execution model– Thread-level speculation (TLS)– Activities executed speculatively

• Conflict detection– Each node/edge has associated exclusive

lock– Graph operations acquire locks on

read/written nodes/edges– Lock owned by another thread conflict

iteration rolled back– All locks released at the end

• Two main overheads– Locking– Undo actions

i1

i2

i3

7

Overheads (I): Locking• Optimizations– Redundant locking elimination– Lock removal for iteration private data– Lock removal for lock domination

• ACQ(P): set of definitely acquired locks per program point P• Given method call M at P:

Locks(M) ACQ(P) Redundant Locking

8

Overheads (II): Undo actions

Lockset Grows

Lockset Stable

Failsafe

…

foreach (Node a : wl) {

…

…

}

foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a);}

Program point P is failsafe if: Q : Reaches(P,Q) Locks(Q) ACQ(P)

9

GSet<Node> wl = new GSet<Node>();wl.addAll(g.getNodes());GBag<Weight> mst = new GBag<Weight>();


Lockset Analysis

• Redundant Locking• Locks(M) ACQ(P)

• Undo elimination• Q : Reaches(P,Q)

Locks(Q) ACQ(P)

• Need to compute ACQ(P)

: Runtime overhead

10

Analysis Challenges• The usual suspects: – Unbounded Memory Undecidability – Aliasing, Destructive updates

• Specific challenges:– Complex ADTs: unstructured graphs– Heap objects are locked– Adapt abstraction to ADTs

• We use Abstract Interpretation [CC’77]– Balance precision and realistic performance

11






Shape Analysis Overview

12

HashMap-Graph

Tree-based Set

……

Graph { @rep nodes @rep edges …}

Graph Spec

Concrete ADTImplementationsin Galois library

Predicate Discovery

Shape Analysis

Boruvka.javaOptimizedBoruvka.java

Set { @rep cont …}

Set Spec

ADT Specifications

13

ADT Specification

Graph<ND,ED> {

@rep set<Node> nodes @rep set<Edge> edges

Set<Node> neighbors(Node n);

}

Graph Spec

...Set<Node> S1 = g.neighbors(n);

...

Boruvka.java

Abstract ADT state by virtual set fields

@locks(n + n.rev(src) + n.rev(src).dst + n.rev(dst) + n.rev(dst).src)@op( nghbrs = n.rev(src).dst + n.rev(dst).src , ret = new Set<Node<ND>>(cont=nghbrs) )

Assumption: Implementation satisfies Spec

14

Graph<ND,ED> {

@rep set<Node> nodes@rep set<Edge> edges

@locks(n + n.rev(src) + n.rev(src).dst + n.rev(dst) + n.rev(dst).src)@op( nghbrs = n.rev(src).dst + n.rev(dst).src , ret = new Set<Node<ND>>(cont=nghbrs) ) Set<Node> neighbors(Node n);}

Modeling ADTs

c

a bGraph Spec

dst

src

srcdst

dst

src

15

Modeling ADTs

c

a b

nodes edges

Abstract State

cont

ret nghbrs

Graph Spec

dst

src

srcdst

dst

src

Graph<ND,ED> {

@rep set<Node> nodes@rep set<Edge> edges

@locks(n + n.rev(src) + n.rev(src).dst + n.rev(dst) + n.rev(dst).src)@op( nghbrs = n.rev(src).dst + n.rev(dst).src , ret = new Set<Node<ND>>(cont=nghbrs) ) Set<Node> neighbors(Node n);}

16






17

cont cont

S1 S2L(S1.cont) L(S2.cont)

Abstraction Scheme

(S1 ≠ S2) ∧ L(S1.cont) ∧ L(S2.cont)

• Parameterized by set of LockPaths: L(Path) o . o ∊ Path Locked(o)– Tracks subset of must-be-locked objects

• Abstract domain elements have the form: Aliasing-configs 2LockPaths …

18

( L(y.nd) ) ( () L(x.nd) )

( L(y.nd) ) ( L(y.nd) L(x.rev(src)) ) ( () L(x.nd) )

Joining Abstract States

( L(y.nd) ) ( () L(x.nd) )

Aliasing is crucial for precisionMay-be-locked does not enable our optimizations

#Aliasing-configs : small constant (6)

19

lt



Example Invariant in Boruvka

The immediate neighbors of a and lt are locked

a

( a ≠ lt ) ∧ L(a) L(a.rev(src)) L(a.rev(dst))∧ ∧ ∧ L(a.rev(src).dst) L(a.rev(dst).src) ∧ ∧ L(lt) L(lt.rev(dst)) L(lt.rev(src)) ∧ ∧ ∧ L(lt.rev(dst).src) L(lt.rev(src).dst)∧

…..

20

Heuristics for Finding Paths

• Hierarchy Summarization (HS)– x.( fld )*– Type hierarchy graph acyclic

bounded number of paths

– Preflow-Push: • L(S.cont) L(S.cont.nd)∧• Nodes in set S and their data are locked

Set<Node>

S

Node

NodeData

cont

nd

21

Footprint Graph Heuristic• Footprint Graphs (FG)[Calcagno et al. SAS’07]– All acyclic paths from arguments of ADT method to

locked objects– x.( fld | rev(fld) )* – Delaunay Refinement: L(S.cont) L(S.cont.rev(src)) L(S.cont.rev(dst)) ∧ ∧ ∧ L(S.cont.rev(src).dst) L(S.cont.rev(dst).src)∧– Nodes in set S and all of their immediate

neighbors are locked

• Composition of HS, FG– Preflow-Push: L(a.rev(src).ed)

FG HS

22




• Shape analysis – Abstract Data Type modeling– Hierarchy summarization abstraction– Predicate discovery


23

Experimental Evaluation• Implement on top of TVLA– Encode abstraction by 3-Valued Shape Analysis

[SRW TOPLAS’02]• Evaluation on 4 Lonestar Java benchmarks

• Inferred all available optimizations• # abstract states practically linear in program size

Benchmark Analysis Time (sec)

Boruvka MST 6

Preflow-Push Maxflow 7

Survey Propagation 12

Delaunay Mesh Refinement 16

24

Impact of Optimizations for 8 Threads

Boruvka MST Delaunay Mesh Refinement

Survey Propaga-tion

Preflow-Push Maxflow

0

50

100

150

200

250

Baseline Optimized

Tim

e (s

ec)

5.6× 4.7×11.4×

2.9× 8-core Intel Xeon @ 3.00 GHz

25

Related Work• Safe programmable speculative parallelism [Prabhu et al. PLDI’10]

– Focused on value speculation on ordered algorithms– Different rollback freedom condition

• Transactional Memory compiler optimizations [Harris et al. PLDI’06, Dragojevic et al. SPAA’09]– Similar optimizations– Don’t target rollback freedom– Imprecise for unbounded data-structures

• Optimizations for parallel graph programs [Mendez-Lojo et al. PPOPP’10]– Manual optimizations– Failsafe subsumes cautious

• Verifying conformance of ADT implementation to specification– The Jahob project (Kuncak, Rinard, Wies et al.)

26

Conclusion• New application for static analysis – Optimization of optimistically parallelized graph

programs

• Novel shape analysis– Utilize observations on the structure of concrete

states and programming style

• Enables optimizations crucial for performance

27

Thank You!

28

Backup

Outline of Boruvka MST CodeGSet<Node> wl = new GSet<Node>();wl.addAll(g.getNodes());GBag<Weight> mst = new GBag<Weight>();


Pick arbitrary worklist node

Find lightest neighbor

Update neighbors of lightest

Update worklist and MST

30

Approximating Sets of Locked Objects

Reachability-based Scheme

cont cont

S1 S2

Reach(S1.cont)Reach(S2.cont) Reach(S2.cont) Reach(S2.cont)

cont cont

S1 S2

Reach(S1.cont)Reach(S2.cont) Reach(S2.cont)

31

cont cont

S1 S2L(S1.cont) L(S2.cont)

Approximating Sets of Locked Objects

Hierarchy Summarization Scheme

cont cont

S1 S2L(S1.cont) L(S2.cont) (S1 ≠ S2)

∧ L(S1.cont) ∧ L(S2.cont)

L(Path) o . o ∊ Path Locked(o)

32

• Graph API uses flags to enable/disable locking and storing undo actions

– removeEdge(Node src, Node dst, Flag f);

Enabling Optimizations in Galois

Challenge: Find minimal flag per ADT method call Solution: Lockset analysis

UNDOLOCKS

NONE

ALL

Lock(src)Lock(dst)

Lock( (src,dst) )addEdge(src, dst);

33

Optimization Conditions• set of definitely acquired locks per program

point

• Given method call at :

• Program point is failsafe if:

– Infer from by simple backward analysis

34

Speculation Overheads and Optimizations

Source of Overhead Optimization

Locking shared objects• Redundant locking elimination• Lock elision for iteration private data• Lock domination

Backup original state for rollback • Avoid backups after failsafe points

35

Modeling ADTs

c d

a b

7

23

4

5

ns es

Abstract State

cont

retnghbrs

Graph<ND,ED> {

@rep set<Node> ns; // nodes @rep set<Edge> es; // edges

@locks(n + n.rev(src) + n.rev(src).dst + n.rev(dst) + n.rev(dst).src)@op( nghbrs = n.rev(src).dst + n.rev(dst).src , ret = new Set<Node<ND>>(cont=nghbrs) ) Set<Node> neighbors(Node n);

}

Graph Spec

a b

5

srcdst

ed

36

Failsafe Points – Eliminating Undo Actions

c d

a b

7

23

4

5

Graph Node

Graph Edge

Edge Data

lt



a

g.neighbors(lt);CAUTIOUS OPERATOR [Mendez et al. PPOPP’10]

37


• Build MST bottom uprepeat { pick arbitrary active node ‘a’ merge with lightest neighbor ‘lt’ add edge ‘a-lt’ to MST} until graph is a singular node

c d

a b

e f

g

2 4

6

5

3

7

4

1

3

7

39

Failsafe Points – Eliminating Undo Actions

c d

a b

7

23

4

5

Graph Node

Graph Edge

Edge Data Acquired

New Lock

Redundant

lt



a

g.neighbors(lt);CAUTIOUS OPERATOR [Mendez et al. PPOPP’10]

40



Redundant Locking Example

c d

a b

7

23

4

5

Graph Node

Graph Edge

Edge Data Acquired

New Lock

Redundant

lt

a

41


c d

a b

e f

g

2 4

6

5

3

7

4

1

3

7



• Build MST iteratively• Pick random active node• Contract edge with lightest

neighbor

42



Parallelism in Boruvka’s Algorithm

c d

a b

e f

g

2 4

6

5

37

4

1 f

• Dependences between activities are functions of runtime values • Parallelism cannot be uncovered

at compile time in general

• Don’t Care Non-Determinism• All produced MSTs correct and

optimal

43

Abstraction of a Single State

c d

a b

7

23

4

5

a

lt

State after first loop in Boruvka

( a lt ) ∧ L(a) ∧ L(a.rev(src)) ∧ L(a.rev(dst)) ∧ L(a.rev(src).dst) ∧ L(a.rev(dst).src) ∧

UniqueRef(EdgeData)

We maintain We loose

Set of definitely locked objects denoted by lockpaths Maybe locked information

Aliasing of top level variables Cardinality of sets

Uniquely pointed-to types and objects referenced from the stack

Content sharing of multiple collections

44

Hierarchy Summarization Intuition

GraphSet<Node>

Weight

Node

Edge

Iterator<Node>

es

src,dst

nscont

ed

past, at,future

nd

all

gaNghbrsltNghbrsnIter

a, lt, n

w,wan,

minW

e,an

Gset<Node>

gcont

wl

GBag<Weight>

mst

bcont

Void

Documents

A Shape Analysis for Optimizing Parallel Graph Programs Dimitrios Prountzos 1 Keshav Pingali 1,2 Roman Manevich 2 Kathryn S. McKinley 1 1: Department of