Upload
jodie-strickland
View
217
Download
1
Embed Size (px)
Citation preview
A Shape Analysis for Optimizing Parallel Graph Programs
Dimitrios Prountzos1
Keshav Pingali1,2
Roman Manevich2
Kathryn S. McKinley1
1: Department of Computer Science, The University of Texas at Austin2: Institute for Computational Engineering and Sciences, The
University of Texas at Austin
2
Motivation
• Graph algorithms are ubiquitous
• Goal: Compiler analysis for optimization of parallel graph algorithms
Computational biology Social Networks
Computer Graphics
3
Organization• Parallelization of graph algorithms in Galois system
– Speculative execution– Example: Boruvka MST algorithm
• Optimization opportunities– Reduce speculation overheads– Analysis problem: LockSet shape analysis
• Lockset shape analysis – Abstract Data Type (ADT) modeling– Hierarchy summarization abstraction– Predicate discovery
• Evaluation– Fast and infers all available optimizations– Optimizations give speedup up to 12x
4
Boruvka’s Minimum Spanning Tree Algorithm
Build MST bottom-uprepeat { pick arbitrary node ‘a’ merge with lightest neighbor ‘lt’ add edge ‘a-lt’ to MST} until graph is a single node
c d
a b
e f
g
2 4
6
5
3
7
4
1d
a,c b
e f
g
4
6
3
4
17
lt
5
• Algorithm = repeated application of operator to graph
– Active node: • Node where computation is needed
– Activity: • Application of operator to active node
– Neighborhood:• Sub-graph read/written to perform activity
– Unordered algorithms: • Active nodes can be processed in any order
• Amorphous data-parallelism– Parallel execution of activities, subject to neighborhood constraints
• Neighborhoods are functions of runtime values – Parallelism cannot be uncovered at compile time in general
Parallelism in Boruvka
i1
i2
i3
6
Optimistic Parallelization in Galois• Programming model
– Client code has sequential semantics– Library of concurrent data structures
• Parallel execution model– Thread-level speculation (TLS)– Activities executed speculatively
• Conflict detection– Each node/edge has associated exclusive
lock– Graph operations acquire locks on
read/written nodes/edges– Lock owned by another thread conflict
iteration rolled back– All locks released at the end
• Two main overheads– Locking– Undo actions
i1
i2
i3
7
Overheads (I): Locking• Optimizations– Redundant locking elimination– Lock removal for iteration private data– Lock removal for lock domination
• ACQ(P): set of definitely acquired locks per program point P• Given method call M at P:
Locks(M) ACQ(P) Redundant Locking
8
Overheads (II): Undo actions
Lockset Grows
Lockset Stable
Failsafe
…
foreach (Node a : wl) {
…
…
}
foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a);}
Program point P is failsafe if: Q : Reaches(P,Q) Locks(Q) ACQ(P)
9
GSet<Node> wl = new GSet<Node>();wl.addAll(g.getNodes());GBag<Weight> mst = new GBag<Weight>();
foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a);}
Lockset Analysis
• Redundant Locking• Locks(M) ACQ(P)
• Undo elimination• Q : Reaches(P,Q)
Locks(Q) ACQ(P)
• Need to compute ACQ(P)
: Runtime overhead
10
Analysis Challenges• The usual suspects: – Unbounded Memory Undecidability – Aliasing, Destructive updates
• Specific challenges:– Complex ADTs: unstructured graphs– Heap objects are locked– Adapt abstraction to ADTs
• We use Abstract Interpretation [CC’77]– Balance precision and realistic performance
11
Organization• Parallelization of graph algorithms in Galois system
– Speculative execution– Example: Boruvka MST algorithm
• Optimization opportunities– Reduce speculation overheads– Analysis problem: LockSet shape analysis
• Lockset shape analysis – Abstract Data Type (ADT) modeling– Hierarchy summarization abstraction– Predicate discovery
• Evaluation– Fast and infers all available optimizations– Optimizations give speedup up to 12x
Shape Analysis Overview
12
HashMap-Graph
Tree-based Set
……
Graph { @rep nodes @rep edges …}
Graph Spec
Concrete ADTImplementationsin Galois library
Predicate Discovery
Shape Analysis
Boruvka.javaOptimizedBoruvka.java
Set { @rep cont …}
Set Spec
ADT Specifications
13
ADT Specification
Graph<ND,ED> {
@rep set<Node> nodes @rep set<Edge> edges
Set<Node> neighbors(Node n);
}
Graph Spec
...Set<Node> S1 = g.neighbors(n);
...
Boruvka.java
Abstract ADT state by virtual set fields
@locks(n + n.rev(src) + n.rev(src).dst + n.rev(dst) + n.rev(dst).src)@op( nghbrs = n.rev(src).dst + n.rev(dst).src , ret = new Set<Node<ND>>(cont=nghbrs) )
Assumption: Implementation satisfies Spec
14
Graph<ND,ED> {
@rep set<Node> nodes@rep set<Edge> edges
@locks(n + n.rev(src) + n.rev(src).dst + n.rev(dst) + n.rev(dst).src)@op( nghbrs = n.rev(src).dst + n.rev(dst).src , ret = new Set<Node<ND>>(cont=nghbrs) ) Set<Node> neighbors(Node n);}
Modeling ADTs
c
a bGraph Spec
dst
src
srcdst
dst
src
15
Modeling ADTs
c
a b
nodes edges
Abstract State
cont
ret nghbrs
Graph Spec
dst
src
srcdst
dst
src
Graph<ND,ED> {
@rep set<Node> nodes@rep set<Edge> edges
@locks(n + n.rev(src) + n.rev(src).dst + n.rev(dst) + n.rev(dst).src)@op( nghbrs = n.rev(src).dst + n.rev(dst).src , ret = new Set<Node<ND>>(cont=nghbrs) ) Set<Node> neighbors(Node n);}
16
Organization• Parallelization of graph algorithms in Galois system
– Speculative execution– Example: Boruvka MST algorithm
• Optimization opportunities– Reduce speculation overheads– Analysis problem: LockSet shape analysis
• Lockset shape analysis – Abstract Data Type (ADT) modeling– Hierarchy summarization abstraction– Predicate discovery
• Evaluation– Fast and infers all available optimizations– Optimizations give speedup up to 12x
17
cont cont
S1 S2L(S1.cont) L(S2.cont)
Abstraction Scheme
(S1 ≠ S2) ∧ L(S1.cont) ∧ L(S2.cont)
• Parameterized by set of LockPaths: L(Path) o . o ∊ Path Locked(o)– Tracks subset of must-be-locked objects
• Abstract domain elements have the form: Aliasing-configs 2LockPaths …
18
( L(y.nd) ) ( () L(x.nd) )
( L(y.nd) ) ( L(y.nd) L(x.rev(src)) ) ( () L(x.nd) )
Joining Abstract States
( L(y.nd) ) ( () L(x.nd) )
Aliasing is crucial for precisionMay-be-locked does not enable our optimizations
#Aliasing-configs : small constant (6)
19
lt
GSet<Node> wl = new GSet<Node>();wl.addAll(g.getNodes());GBag<Weight> mst = new GBag<Weight>();
foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a);}
Example Invariant in Boruvka
The immediate neighbors of a and lt are locked
a
( a ≠ lt ) ∧ L(a) L(a.rev(src)) L(a.rev(dst))∧ ∧ ∧ L(a.rev(src).dst) L(a.rev(dst).src) ∧ ∧ L(lt) L(lt.rev(dst)) L(lt.rev(src)) ∧ ∧ ∧ L(lt.rev(dst).src) L(lt.rev(src).dst)∧
…..
20
Heuristics for Finding Paths
• Hierarchy Summarization (HS)– x.( fld )*– Type hierarchy graph acyclic
bounded number of paths
– Preflow-Push: • L(S.cont) L(S.cont.nd)∧• Nodes in set S and their data are locked
Set<Node>
S
Node
NodeData
cont
nd
21
Footprint Graph Heuristic• Footprint Graphs (FG)[Calcagno et al. SAS’07]– All acyclic paths from arguments of ADT method to
locked objects– x.( fld | rev(fld) )* – Delaunay Refinement: L(S.cont) L(S.cont.rev(src)) L(S.cont.rev(dst)) ∧ ∧ ∧ L(S.cont.rev(src).dst) L(S.cont.rev(dst).src)∧– Nodes in set S and all of their immediate
neighbors are locked
• Composition of HS, FG– Preflow-Push: L(a.rev(src).ed)
FG HS
22
Organization• Parallelization of graph algorithms in Galois system
– Speculative execution– Example: Boruvka MST algorithm
• Optimization opportunities– Reduce speculation overheads– Analysis problem: LockSet shape analysis
• Shape analysis – Abstract Data Type modeling– Hierarchy summarization abstraction– Predicate discovery
• Evaluation– Fast and infers all available optimizations– Optimizations give speedup up to 12x
23
Experimental Evaluation• Implement on top of TVLA– Encode abstraction by 3-Valued Shape Analysis
[SRW TOPLAS’02]• Evaluation on 4 Lonestar Java benchmarks
• Inferred all available optimizations• # abstract states practically linear in program size
Benchmark Analysis Time (sec)
Boruvka MST 6
Preflow-Push Maxflow 7
Survey Propagation 12
Delaunay Mesh Refinement 16
24
Impact of Optimizations for 8 Threads
Boruvka MST Delaunay Mesh Refinement
Survey Propaga-tion
Preflow-Push Maxflow
0
50
100
150
200
250
Baseline Optimized
Tim
e (s
ec)
5.6× 4.7×11.4×
2.9× 8-core Intel Xeon @ 3.00 GHz
25
Related Work• Safe programmable speculative parallelism [Prabhu et al. PLDI’10]
– Focused on value speculation on ordered algorithms– Different rollback freedom condition
• Transactional Memory compiler optimizations [Harris et al. PLDI’06, Dragojevic et al. SPAA’09]– Similar optimizations– Don’t target rollback freedom– Imprecise for unbounded data-structures
• Optimizations for parallel graph programs [Mendez-Lojo et al. PPOPP’10]– Manual optimizations– Failsafe subsumes cautious
• Verifying conformance of ADT implementation to specification– The Jahob project (Kuncak, Rinard, Wies et al.)
26
Conclusion• New application for static analysis – Optimization of optimistically parallelized graph
programs
• Novel shape analysis– Utilize observations on the structure of concrete
states and programming style
• Enables optimizations crucial for performance
27
Thank You!
28
Backup
Outline of Boruvka MST CodeGSet<Node> wl = new GSet<Node>();wl.addAll(g.getNodes());GBag<Weight> mst = new GBag<Weight>();
foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a);}
Pick arbitrary worklist node
Find lightest neighbor
Update neighbors of lightest
Update worklist and MST
30
Approximating Sets of Locked Objects
Reachability-based Scheme
cont cont
S1 S2
Reach(S1.cont)Reach(S2.cont) Reach(S2.cont) Reach(S2.cont)
cont cont
S1 S2
Reach(S1.cont)Reach(S2.cont) Reach(S2.cont)
31
cont cont
S1 S2L(S1.cont) L(S2.cont)
Approximating Sets of Locked Objects
Hierarchy Summarization Scheme
cont cont
S1 S2L(S1.cont) L(S2.cont) (S1 ≠ S2)
∧ L(S1.cont) ∧ L(S2.cont)
L(Path) o . o ∊ Path Locked(o)
32
• Graph API uses flags to enable/disable locking and storing undo actions
– removeEdge(Node src, Node dst, Flag f);
Enabling Optimizations in Galois
Challenge: Find minimal flag per ADT method call Solution: Lockset analysis
UNDOLOCKS
NONE
ALL
Lock(src)Lock(dst)
Lock( (src,dst) )addEdge(src, dst);
33
Optimization Conditions• set of definitely acquired locks per program
point
• Given method call at :
• Program point is failsafe if:
– Infer from by simple backward analysis
34
Speculation Overheads and Optimizations
Source of Overhead Optimization
Locking shared objects• Redundant locking elimination• Lock elision for iteration private data• Lock domination
Backup original state for rollback • Avoid backups after failsafe points
35
Modeling ADTs
c d
a b
7
23
4
5
ns es
Abstract State
cont
retnghbrs
Graph<ND,ED> {
@rep set<Node> ns; // nodes @rep set<Edge> es; // edges
@locks(n + n.rev(src) + n.rev(src).dst + n.rev(dst) + n.rev(dst).src)@op( nghbrs = n.rev(src).dst + n.rev(dst).src , ret = new Set<Node<ND>>(cont=nghbrs) ) Set<Node> neighbors(Node n);
}
Graph Spec
a b
5
srcdst
ed
36
Failsafe Points – Eliminating Undo Actions
c d
a b
7
23
4
5
Graph Node
Graph Edge
Edge Data
lt
GSet<Node> wl = new GSet<Node>();wl.addAll(g.getNodes());GBag<Weight> mst = new GBag<Weight>();
foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a);}
a
g.neighbors(lt);CAUTIOUS OPERATOR [Mendez et al. PPOPP’10]
37
Boruvka’s Minimum Spanning Tree Algorithm
• Build MST bottom uprepeat { pick arbitrary active node ‘a’ merge with lightest neighbor ‘lt’ add edge ‘a-lt’ to MST} until graph is a singular node
c d
a b
e f
g
2 4
6
5
3
7
4
1
3
7
39
Failsafe Points – Eliminating Undo Actions
c d
a b
7
23
4
5
Graph Node
Graph Edge
Edge Data Acquired
New Lock
Redundant
lt
GSet<Node> wl = new GSet<Node>();wl.addAll(g.getNodes());GBag<Weight> mst = new GBag<Weight>();
foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a);}
a
g.neighbors(lt);CAUTIOUS OPERATOR [Mendez et al. PPOPP’10]
40
GSet<Node> wl = new GSet<Node>();wl.addAll(g.getNodes());GBag<Weight> mst = new GBag<Weight>();
foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a);}
Redundant Locking Example
c d
a b
7
23
4
5
Graph Node
Graph Edge
Edge Data Acquired
New Lock
Redundant
lt
a
41
Boruvka’s Minimum Spanning Tree Algorithm
c d
a b
e f
g
2 4
6
5
3
7
4
1
3
7
GSet<Node> wl = new GSet<Node>();wl.addAll(g.getNodes());GBag<Weight> mst = new GBag<Weight>();
foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a);}
• Build MST iteratively• Pick random active node• Contract edge with lightest
neighbor
42
GSet<Node> wl = new GSet<Node>();wl.addAll(g.getNodes());GBag<Weight> mst = new GBag<Weight>();
foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a);}
Parallelism in Boruvka’s Algorithm
c d
a b
e f
g
2 4
6
5
37
4
1 f
• Dependences between activities are functions of runtime values • Parallelism cannot be uncovered
at compile time in general
• Don’t Care Non-Determinism• All produced MSTs correct and
optimal
43
Abstraction of a Single State
c d
a b
7
23
4
5
a
lt
State after first loop in Boruvka
( a lt ) ∧ L(a) ∧ L(a.rev(src)) ∧ L(a.rev(dst)) ∧ L(a.rev(src).dst) ∧ L(a.rev(dst).src) ∧
UniqueRef(EdgeData)
We maintain We loose
Set of definitely locked objects denoted by lockpaths Maybe locked information
Aliasing of top level variables Cardinality of sets
Uniquely pointed-to types and objects referenced from the stack
Content sharing of multiple collections
44
Hierarchy Summarization Intuition
GraphSet<Node>
Weight
Node
Edge
Iterator<Node>
es
src,dst
nscont
ed
past, at,future
nd
all
gaNghbrsltNghbrsnIter
a, lt, n
w,wan,
minW
e,an
Gset<Node>
gcont
wl
GBag<Weight>
mst
bcont
Void