22
1 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh , Lauro Costa, Elizeu Santos-Neto and Matei Ripeanu Netsyslab The University of British Columbia

If you were plowing a field, which would you rather use?

Embed Size (px)

DESCRIPTION

If you were plowing a field, which would you rather use?. Two oxen, or 1024 chickens? (Attributed to S. Cray). Abdullah Gharaibeh , Lauro Costa, Elizeu Santos-Neto and Matei Ripeanu Netsyslab The University of British Columbia. 1. The ‘Field’ to Plow : Graph processing. 1B users - PowerPoint PPT Presentation

Citation preview

Page 1: If you were plowing a field,  which would you rather use?

11

If you were plowing a field, which would you rather

use?Two oxen, or 1024 chickens?(Attributed to S. Cray)

Abdullah Gharaibeh, Lauro Costa, Elizeu Santos-Neto and Matei Ripeanu

NetsyslabThe University of British Columbia

Page 2: If you were plowing a field,  which would you rather use?

2

The ‘Field’ to Plow : Graph processing

1.4B pages, 6.6B links

1B users150B friendships

100B neurons700T connections

Page 3: If you were plowing a field,  which would you rather use?

3

Graph Processing Challenges

Data-dependent memory access patterns

Caches + summary data structures

Large memory footprint

>256GB

CPUs

Poor locality

Low compute-to-memory access ratio

Varying degrees of parallelism(both intra- and inter- stage)

Page 4: If you were plowing a field,  which would you rather use?

4

The GPU Opportunity

Data-dependent memory access patterns

Caches + summary data structures

Large memory footprint

>256GB

CPUs

Poor locality

Assemble a hybrid platform

Massive hardware multithreading

6GB!

GPUs

Low compute-to-memory access ratio

Caches + summary data structures

Varying degrees of parallelism(both intra- and inter- stage)

Page 5: If you were plowing a field,  which would you rather use?

5

YES WE CAN!2x speedup (4 billion edges)

Motivating Question

Can we efficiently use hybrid systems for

large-scale graph processing?

Page 6: If you were plowing a field,  which would you rather use?

66

Performance Model Predicts speedup Intuitive

Totem A graph processing engine for hybrid

systems Applies algorithm-agnostic optimizations

Partitioning Strategies Can one do better than random?

Evaluation Partitioning strategies Hybrid vs. Symmetric

Methodology

A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing, PACT 2012

On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest, IPDPS 2013

Page 7: If you were plowing a field,  which would you rather use?

77

Totem: Compute Model

Bulk Synchronous Parallel Rounds of computation and

communication phases

Updates to remote vertices are delivered in the next round

Partitions vote to terminate execution

.

.

.

Page 8: If you were plowing a field,  which would you rather use?

8

Totem: Programming Model A user-defined kernel runs simultaneously on each partition

Vertices are processed in parallel within each partition Messages to remote vertices are combined if a combiner

function is defined

msg_combine

msg_combine_func

Page 9: If you were plowing a field,  which would you rather use?

9

Target Platform

9

Intel Nehalem Fermi GPU Xeon X5650 Tesla C2075

(2x sockets) (2x GPUs)

Core Frequency 2.67GHz 1.15GHz

Num HW-threads 12 448

Last Level Cache 12MB 2MB

Main Memory 144GB 6GB

Memory Bandwidth 32GB/sec 144GB/sec

Page 10: If you were plowing a field,  which would you rather use?

10

Use Case: Breadth-first Search• Graph Traversal algorithm• Very little computation per vertex• Sensitive to cache utilization

Level 1

Level 0

5

2

9

4

1 0 1 0 1 0 0 0 1 0

Visited bit-vector

Page 11: If you were plowing a field,  which would you rather use?

11

BSF Performance

11Workload: scale-28 Graph500 benchmark (|V|=256m, |E|=4B)

1S1G = 1 CPU Socket + 1 GPU

Limited GPU memory prevents offloading more

Linear improvement with respect to offloaded part

Can we do better than random partitioning?

Page 12: If you were plowing a field,  which would you rather use?

12

BSF Performance

12Workload: scale-28 Graph500 benchmark (|V|=256m, |E|=4B)

1S1G = 1 CPU Socket + 1 GPU

Computation phase dominates run time

Page 13: If you were plowing a field,  which would you rather use?

13

Workload-Processer Matchmaking

Observation:Real-world graphs have power-law edge degree distribution

Idea:Place the many low-degree vertices on GPU

Place the few high-degree vertices on CPU

Log-log plot of edge degree distribution for LiveJournal social network

Many low degree vertices

Few high degree vertices

Page 14: If you were plowing a field,  which would you rather use?

14

BSF performance

14Workload: scale-28 Graph500 benchmark (|V|=256m, |E|=4B)

1S1G = 1 CPU Socket + 1 GPU

2x performance gain!

LOW: No gain over random!

Exploring the 75% data point

Specialization leads to superlinear speedup

Page 15: If you were plowing a field,  which would you rather use?

15Host is the bottleneck in all cases!

BSF performance – more details

HIGH leads to better load balancing

Page 16: If you were plowing a field,  which would you rather use?

16

PageRank: A More Compute Intensive Use Case

16Workload: scale-28 Graph500 benchmark (|V|=256m, |E|=4B)1S1G = 1 CPU Socket + 1 GPU

LOW: Minimal gain over andom!Better packing

25% performance gain!

Page 17: If you were plowing a field,  which would you rather use?

17

Same Analysis, Smaller Graph

(scale-25 RMAT graphs: |V|=32m, |E|=512m)

17

BFS PageRank

Intelligent partitioning provides benefits Key for performance: load balancing

Page 18: If you were plowing a field,  which would you rather use?

18

Worst Case: Uniform Graph (not scale free)

18

BFS on scale-25 uniform graph (|V|=32m, |E|=512m)

Hybrid techniques not useful for uniform graphs

Page 19: If you were plowing a field,  which would you rather use?

19

Scalability: BFS

19S = CPU Socket G = GPU

Hybrid offers ~2x speedup vs Symmetric

GPU memory is limited, GPU share < 1% of the edges!

Comparable to top 100 in Graph500 challenge!

Page 20: If you were plowing a field,  which would you rather use?

20

Scalability: PageRank

20

Hybrid offers better gains for more compute intensive algorithms

S = CPU Socket G = GPU

Page 21: If you were plowing a field,  which would you rather use?

21

Summary

• Problem: large-scale graph processing is challenging– Heterogeneous workload – power-law degree

distribution

• Solution: Totem Framework– Harnesses hybrid systems – Utilizes the

strengths of CPUs and GPUs– Powerful abstraction – simplifies implementing

algorithms for hybrid systems– Partitioning – workload to processor matchmaking– Algorithms – BFS, SSSP, PageRank, Centrality

Measures

Page 22: If you were plowing a field,  which would you rather use?

22

QuestionsQuestions

code@: netsyslab.ece.ubc.ca