Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Understanding Priority-Based Scheduling of Graph
Algorithms on a Shared-Memory Platform
Serif Yesil∗, Azin Heidershenas∗, Adam Morrison+, and Josep Torrellas∗
∗University of Illinois at Urbana-Champaign and +Tel Aviv University
November 20, 2019
Priority-Based Scheduling of Graph Algorithms November 20, 2019 1 / 28
Task Based Graph Processing
Many graph processing frameworks use a task based model
Computation is broken down into dynamically created tasks
Easy to program
Priority-Based Scheduling of Graph Algorithms November 20, 2019 2 / 28
Unordered Task Execution
Tasks can be processed in any order
Correct result regardless of execution order
Easy to parallelize
Problem: may cause unnecessary work
1 Vertex 4 is scheduled
2 Vertex ∞ becomes 7
3 Vertex 1 is scheduled
4 Vertex 7 becomes 2
Update in step 2 is unnecessary
SSSP Application
0
1 4
∞
1
4
1
3
Scheduler
Graph
72
Priority-Based Scheduling of Graph Algorithms November 20, 2019 3 / 28
Adding Task Priorities and Using a Priority Scheduler
Each task becomes associated with a priority
Tasks are sorted by priorities in the scheduler
For SSSP priority=distance, ordered in ascendingorder
1 Vertex 1 is scheduled
2 Vertex ∞ becomes 2
0
1 4
2
1
2
1
3
Priority Scheduler
Graph
14
Priority-Based Scheduling of Graph Algorithms November 20, 2019 4 / 28
Fundamental Trade-off in Concurrent Priority Schedulers:Synchronization vs. Useless Work
Scheduling in strict priority order causes synchronization overhead
Solution: Use Relaxed Concurrent Priority Scheduler (CPS)
Highly relaxed priority processing → more useless work
Priority-Based Scheduling of Graph Algorithms November 20, 2019 5 / 28
Contributions
Provide an extensive empirical analysis of existing CPS designs
Identify shortcomings of high performance CPS designs
Propose a CPS approach (PMOD) to provide robust performance automatically
Priority-Based Scheduling of Graph Algorithms November 20, 2019 6 / 28
Types of Concurrent Priority Schedulers
Combined-Priority Queue CPSs: Each queue has multiple prioritiesSprayLists (SL) – Global Queue [Alistarh et.al.]Multiqueues (MQ) – Distributed Queues [Rihani et.al.]Remote Enqueue Local Dequeue (RELD) – Distributed Queues [Jeffrey et.al.]
Per-Priority Queue CPSs: Each priority in a different queueOrdered-By Integer Metric (obim) [Nguyen et.al.]
Priority-Based Scheduling of Graph Algorithms November 20, 2019 8 / 28
SprayList
A global queue design
Tasks are stored in a lock-free skip list, sorted in priority order
To minimize synchronization, dequeues retrieve a random high priority task
Done with a short random walk
Priority-Based Scheduling of Graph Algorithms November 20, 2019 9 / 28
Distributed Queues
MultiQueues
Create k queues per core (Total:k × p)
Enqueue to a random queue
Dequeues retrieve the highest prioritytask from two random queues
c0 q0 q4 c1 q1 q5
c2 q2 q6
CoreQueue
c3 q3 q7
Enq
Enq
<Deq
Remote Enqueue Local Dequeue
One queues per core (Total:p)
Enqueue to a random queue
Dequeues go to the local queue
c0 q0Deq
c1 q1
c2 q2 c3 q3
CoreQueue
Enq
Enq
Priority-Based Scheduling of Graph Algorithms November 20, 2019 10 / 28
OBIM Chunking
Core
Local Map
Per-PriorityQueue
Global Map
c0 c1 c2 c3
2 5 2 2 7 5 7
q2q2 q5 q7
2 5 7
cx
Creates a queue for each priority
Uses a global map for ordering
Access bottleneck
A local map caches a snapshot of theglobal map in each core
When needed, local maps are refreshedfrom the global map
To reduce contention, a core enqueuesand dequeues a chunk of tasks at atime
Priority-Based Scheduling of Graph Algorithms November 20, 2019 11 / 28
Minimizing Useless Work vs. Minimizing Communication
Minimize Useless Work
Combined-priority queues
SL and MQ relaxation guarantees
Retrieve tasks close to priority order
Invest in synchronization andcommunication
Higher enqueue and dequeue cost
Minimize Communication
Per-priority queues: obim
Emphasis on reducing enqueuedequeue cost
Coarse-grain enqueue/dequeueoperations
Prone to useless work
Priority-Based Scheduling of Graph Algorithms November 20, 2019 12 / 28
CPS Performance Comparison
How well do relaxed CPSs work in practice?
Useless work behavior
Performance of CPS operations
Priority-Based Scheduling of Graph Algorithms November 20, 2019 14 / 28
Types of Execution Cycles in an Application
Work Cycles
Good Work (GWork): The work ends up being useful.Useless Work (UWork): The work later proves useless.
CPS Operation Cycles
Enqueue (Enq)Dequeue (Deq)Failed Dequeue (FDeq): Attempting and failing to dequeue
Other
Other: Executing framework code.
Priority-Based Scheduling of Graph Algorithms November 20, 2019 15 / 28
Analysis Setup
Galois framework
5 applications
Single source shortest path(SSSP)Breadth first search (BFS)PageRank (PR)Minimum spanning tree (MST)A* (A*)
5 inputs graphs
Graph # Verts # Edges
USA roads (rUSA) 24 M 58 MWest USA roads (rW ) 6 M 15 MTwitter40 (tw) 42 M 1469 MWeb-Google (wg) 875 K 5 MSoc-LiveJournal1 (lj) 5 M 69 M
Priority-Based Scheduling of Graph Algorithms November 20, 2019 16 / 28
Speedups: Combined-Priority Queues vs. Per-Priority Queues
1 5 10 20 30 40# Threads
15
10
20
30
40
Spee
dup
obimRELDMQSL
SSSP with tw
1 5 10 20 30 40# Threads
15
10
20
30
40
Spee
dup
obimRELDMQSL
BFS with rUSA
1 5 10 20 30 40# Threads
15
10
20
30
40
Spee
dup
obimRELDMQSL
PR with wg
Per-priority queue performs better than combined priority queues
Priority-Based Scheduling of Graph Algorithms November 20, 2019 17 / 28
Cycles: Combined-Priority Queues vs. Per-Priority Queue CPSs
obim
RELD MQ SL
0
25
50
75
100Cy
cles (
%)
SSSP tw
obim
RELD MQ SL
BFS rUSA
obim
RELD MQ SL
PR wg
OtherFDeqDeqEnqGWorkUWork
Small amount of useless workCombined-priority queues spend upto 90% of cycles on CPS operations
Priority-Based Scheduling of Graph Algorithms November 20, 2019 18 / 28
Per-Priority Queues: Effect of Chunking
1 5 10 20 30 40# Threads
15
10
20
30
40Sp
eedu
p
c0c16c64c256
BFS with rUSAc0⇒no chunks
c16⇒chunk size 16c64⇒chunk size 64
c256⇒chunk size 256
Enqueueing and dequeueing a chunk of tasks isvery beneficial
Chunking decreases the Deq time.
No chunks: Higher DeqLarge chunk sizes: Higher FDeq
Priority-Based Scheduling of Graph Algorithms November 20, 2019 19 / 28
Per-Priority Queues: Effect of Priority Distribution
Different algorithms create different priority distributions
BFS: Close by prioritiesSSSP: Dispersed priorities
Dispersed priorities lowers the performance of per-priority queue CPSs
Create many per-priority queues with few tasks
Priority-Based Scheduling of Graph Algorithms November 20, 2019 20 / 28
Ad-Hoc Solution: Manually Merging Priorities
Different priorities are grouped together in the same per-priority queue
Merging Factor(∆):
Range of priorities that map to the same queueBest ∆ parameter is input dependent
Used in SSSP application:
Optimal ∆ for SSSP with rUSA is ∆ = 14 ⇒ Range=[0− 214)Optimal ∆ for SSSP with tw is ∆ = 0 ⇒ No priority merging
Priority-Based Scheduling of Graph Algorithms November 20, 2019 21 / 28
Effect of Merging Factor for SSSP
1 5 10 20 30 40# Threads
15
10
20
30
40Sp
eedu
p
Δ10Δ14Δ18
SSSP with rUSA
∆10 ⇒ Merging Factor=10∆14 ⇒ Merging Factor=14
∆18 ⇒ Merging Factor=18
Priority merging has significant impact
Too-small ∆: Dequeue has high overhead
Too-large ∆: High useless work
Optimal ∆: 20-35% of cycles spent for uselesswork
Priority-Based Scheduling of Graph Algorithms November 20, 2019 22 / 28
Proposed CPS: Priority Merging on Demand (PMOD)
Too many per-priority queues → higher enqueue and dequeue times
Too few per-priority queues → more useless work
Idea: Dynamically detect inefficient execution and tune Merging Factor
Too many per-priority queues → Merge priorities (increase ∆)Too few per-priority queues → Unmerge priorities (decrease ∆)
Priority-Based Scheduling of Graph Algorithms November 20, 2019 24 / 28
Detect and Adjust Merging Factor
Too many per-priority queues → Merge priorities (increase ∆)
1 Trigger: Frequent global/local map synchronizations2 Check: Density of per-priority queues3 Increase merging factor
Too few per-priority queues → Unmerge priorities (decrease ∆)
1 Trigger: Too many tasks dequeued from a per-priority queue2 Check: Number of per-priority queues and priority range3 Decrease merging factor
PMOD challenge: Priority ordering with different merging factors
Details in the paper...
Priority-Based Scheduling of Graph Algorithms November 20, 2019 25 / 28
PMOD Performance
1 5 10 20 30 40# Threads
15
10
20
30
40
Spee
dup
obimPMOD
SSSP with rUSA
1 5 10 20 30 40# Threads
15
10
20
30
40
Spee
dup
obimPMOD
SSSP with tw
Same performance as obim without manual/application specific tuningMerging factor converges to 13 for SSSP rUSA (∆ = 14)
Priority-Based Scheduling of Graph Algorithms November 20, 2019 26 / 28
PMOD Performance
1 5 10 20 30 40# Threads
15
10
20
30
40
Spee
dup
obimPMOD
BFS with rUSA
1 5 10 20 30 40# Threads
15
10
20
30
40
Spee
dup
obimPMOD
BFS with tw
Same performance as obim without manual/application specific tuningMerging factor doesn’t increase when not needed: BFS
Priority-Based Scheduling of Graph Algorithms November 20, 2019 27 / 28
Conclusions
Fundamental trade-off between useless work and synchronization overhead
Current CPSs often incur large enqueue/dequeue overheads
Per-priority CPSs that minimize CPS operation cost give better performance
But performance is application & input dependent
PMOD provides high performance automatically
Same performance as per-application manually optimized application with obim
In the paper
Performance sensitivity to chunk sizes & priority mergingobim performance sensitivity to priority rangesMore applications and insights
Priority-Based Scheduling of Graph Algorithms November 20, 2019 28 / 28