43
A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn , Sungpack Hong * , Sungjoo Yoo, Onur Mutlu + , Kiyoung Choi Seoul National University * Oracle Labs + Carnegie Mellon University

A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Embed Size (px)

Citation preview

Page 1: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing

Junwhan Ahn, Sungpack Hong*, Sungjoo Yoo, Onur Mutlu+, Kiyoung Choi

Seoul National University *Oracle Labs +Carnegie Mellon University

Page 2: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Graphs

• Abstract representation of object relationships– Vertex: object (e.g., person, article, …)

– Edge: relationship (e.g., friendships, hyperlinks, …)

• Recent trend: explosive increase in graph size

36 Million Wikipedia Pages

1.4 BillionFacebook Users

300 MillionTwitter Users

30 BillionInstagram Photos

Page 3: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

𝑅 𝑖 = 𝛼 +

𝑗∈Succ(𝑖)

𝑤𝑗𝑖𝑅[𝑗]

Large-Scale Graph Processing

• Example: Google’s PageRank

for (v: graph.vertices) {

for (w: v.successors) {

w.next_rank += weight * v.rank;

}

}

for (v: graph.vertices) {

v.rank = v.next_rank; v.next_rank = alpha;

}

Page 4: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

for (v: graph.vertices) {

for (w: v.successors) {

w.next_rank += weight * v.rank;

}

}

for (v: graph.vertices) {

v.rank = v.next_rank; v.next_rank = alpha;

}

𝑅 𝑖 = 𝛼 +

𝑗∈Succ(𝑖)

𝑤𝑗𝑖𝑅[𝑗]

Large-Scale Graph Processing

• Example: Google’s PageRank

Independent to Each Vertex

Vertex-Parallel Abstraction

Page 5: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

PageRank Performance

+42%

0

1

2

3

4

5

6

32 CoresDDR3

128 CoresDDR3

Spee

du

p

Page 6: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Bottleneck of Graph Processing

for (v: graph.vertices) {

for (w: v.successors) {

w.next_rank += weight * v.rank;

}

}

weight * v.rank

v

w

&w

1. Frequent random memory accesses

2. Little amount of computation

w.rank

w.next_rank

w.edges

Page 7: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Bottleneck of Graph Processing

for (v: graph.vertices) {

for (w: v.successors) {

w.next_rank += weight * v.rank;

}

}

weight * v.rank

v

w

&w

1. Frequent random memory accesses

2. Little amount of computation

w.rank

w.next_rank

w.edges

High Memory Bandwidth Demand

Page 8: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

PageRank Performance

+42%

0

1

2

3

4

5

6

32 CoresDDR3

128 CoresDDR3

Spee

du

p

(102.4GB/s) (102.4GB/s)

Page 9: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

PageRank Performance

+42%+89%

0

1

2

3

4

5

6

32 CoresDDR3

128 CoresDDR3

128 CoresHMC

Spee

du

p

(102.4GB/s) (102.4GB/s) (640GB/s)

Page 10: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

PageRank Performance

+42%+89%

0

1

2

3

4

5

6

32 CoresDDR3

128 CoresDDR3

128 CoresHMC

Spee

du

p

(102.4GB/s) (102.4GB/s) (640GB/s)

Page 11: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

PageRank Performance

+42%+89%

5.3x

0

1

2

3

4

5

6

32 CoresDDR3

128 CoresDDR3

128 CoresHMC

128 CoresHMC Internal BW

Spee

du

p

(102.4GB/s) (102.4GB/s) (640GB/s) (8TB/s)

Lessons Learned:1. High memory bandwidth is the key to

the scalability of graph processing

2. Conventional systems do not fully utilize high memory bandwidth

Page 12: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Challenges in Scalable Graph Processing

• Challenge 1: How to provide high memory bandwidth to computation units in a practical way?– Processing-in-memory based on 3D-stacked DRAM

• Challenge 2: How to design computation units that efficiently exploit large memory bandwidth?– Specialized in-order cores called Tesseract cores

• Latency-tolerant programming model

• Graph-processing-specific prefetching schemes

Page 13: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Tesseract System

Crossbar Network

……

……

DR

AM

Co

ntro

ller

NI

In-Order Core

Message Queue

PF Buffer

MTP

LP

Host Processor

Memory-MappedAccelerator Interface

(Noncacheable, Physically Addressed)

Page 14: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Tesseract System

Crossbar Network

……

……

DR

AM

Co

ntro

ller

NI

In-Order Core

Message Queue

PF Buffer

MTP

LP

Host Processor

Memory-MappedAccelerator Interface

(Noncacheable, Physically Addressed)

Page 15: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Tesseract System

Crossbar Network

……

……

DR

AM

Co

ntro

ller

NI

In-Order Core

PF Buffer

MTP

LP

Host Processor

Memory-MappedAccelerator Interface

(Noncacheable, Physically Addressed)

Communications viaRemote Function Calls

Message Queue

Page 16: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Communications in Tesseract

for (v: graph.vertices) {

for (w: v.successors) {

w.next_rank += weight * v.rank;

}

}

v

w

&w

Page 17: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Communications in Tesseract

for (v: graph.vertices) {

for (w: v.successors) {

w.next_rank += weight * v.rank;

}

}

w

Vault #1 Vault #2

v &w

Page 18: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Communications in Tesseract

for (v: graph.vertices) {

for (w: v.successors) {

put(w.id, function() { w.next_rank += weight * v.rank; });

}

}

barrier();

Can be delayeduntil the nearest barrier

w

Vault #1 Vault #2

put

put

put

put

v &w

Non-blocking Remote Function Call

Page 19: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Non-blocking Remote Function Call

LocalCore

NI

RemoteCore

NI&func, &w, value

1. Send function address & args to the remote core

2. Store the incoming message to the message queue

3. Flush the message queue when it is full or a synchronization barrier is reached

put(w.id, function() { w.next_rank += value; })

MQ

Page 20: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Benefits of Non-blocking Remote Function Call

• Latency hiding through fire-and-forget– Local cores are not blocked by remote function calls

• Localized memory traffic– No off-chip traffic during remote function call execution

• No need for mutexes– Non-blocking remote function calls are atomic

• Prefetching– Will be covered shortly

Page 21: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Tesseract System

Crossbar Network

……

……

DR

AM

Co

ntro

ller

NI

In-Order Core

Message Queue

PF Buffer

MTP

LP

Host Processor

Memory-MappedAccelerator Interface

(Noncacheable, Physically Addressed)

Page 22: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Tesseract System

Crossbar Network

……

……

DR

AM

Co

ntro

ller

NI

In-Order Core

Message Queue

Host Processor

Memory-MappedAccelerator Interface

(Noncacheable, Physically Addressed)

Prefetching

PF Buffer

MTP

LP

Page 23: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Memory Access Patterns in Graph Processing

for (v: graph.vertices) {

for (w: v.successors) {

put(w.id, function() { w.next_rank += weight * v.rank; });

}

}

w

Vault #1 Vault #2

v &w

Seq

uen

tial

(Lo

cal)

Random(Remote)

Page 24: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Message-Triggered Prefetching

• Prefetching random memory accesses is difficult

• Opportunities in Tesseract– Domain-specific knowledge of target data address

– Time slack of non-blocking remote function calls

for (v: graph.vertices) {

for (w: v.successors) {

put(w.id, function() { w.next_rank += weight * v.rank; });

}

}

barrier();

Can be delayeduntil the nearest barrier

Page 25: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Message-Triggered Prefetching

DR

AM

Co

ntro

llerNI

put(w.id, function() { w.next_rank += value; })

In-Order Core

Message Queue

MTP

PF Buffer

Page 26: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Message-Triggered Prefetching

DR

AM

Co

ntro

llerNI

In-Order Core

Message Queue

MTP

put(w.id, function() { w.next_rank += value; }, &w.next_rank)

PF Buffer

Page 27: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Message-Triggered Prefetching

1. Message M1 received

2. Request prefetch

3. Mark M1 as ready when the prefetch is serviced

DR

AM

Co

ntro

llerNI

In-Order Core

Message Queue

MTP

put(w.id, function() { w.next_rank += value; }, &w.next_rank)

M1

PF Bufferw.next..

Page 28: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Message-Triggered Prefetching

1. Message M1 received

2. Request prefetch

3. Mark M1 as ready when the prefetch is serviced

4. Process multiple readymessages at once

DR

AM

Co

ntro

llerNI

In-Order Core

Message Queue

PF Buffer

MTP

put(w.id, function() { w.next_rank += value; }, &w.next_rank)

d1

M4

d2 d3

M1 M2

M3

Page 29: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Other Features of Tesseract

• Blocking remote function calls

• List prefetching

• Prefetch buffer

• Programming APIs

• Application mapping

Please see the paper for details

Page 30: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Evaluated Systems

HMC-MC

128In-Order2GHz

128In-Order2GHz

128In-Order2GHz

128In-Order2GHz

102.4GB/s 640GB/s 640GB/s 8TB/s

HMC-OoO

8 OoO4GHz

8 OoO4GHz

8 OoO4GHz

8 OoO4GHz

(with FDP)

8 OoO4GHz

8 OoO4GHz

8 OoO4GHz

8 OoO4GHz

DDR3-OoO(with FDP)

Tesseract

32 Tesseract

Cores

(32-entry MQ, 4KB PF Buffer)

Page 31: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Workloads

• Five graph processing algorithms– Average teenage follower

– Conductance

– PageRank

– Single-source shortest path

– Vertex cover

• Three real-world large graphs– ljournal-2008 (social network)

– enwiki-2003 (Wikipedia)

– indochina-0024 (web graph)

– 4~7M vertices, 79~194M edges

Page 32: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Performance

+56% +25%

9.0x

11.6x

13.8x

0

2

4

6

8

10

12

14

16

DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract-LP

Tesseract-LP-MTP

Spee

du

p

Page 33: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Performance

+56% +25%

9.0x

11.6x

13.8x

0

2

4

6

8

10

12

14

16

DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract-LP

Tesseract-LP-MTP

Spee

du

p

80GB/s 190GB/s 243GB/s

1.3TB/s

2.2TB/s

2.9TB/s

0

0.5

1

1.5

2

2.5

3

3.5

DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract-LP

Tesseract-LP-MTP

Mem

ory

Ban

dw

idth

(TB

/s)

Memory Bandwidth Consumption

Page 34: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Iso-Bandwidth Comparison

2.3x

3.0x

6.5x

0

1

2

3

4

5

6

7

HMC-MC HMC-MC +PIM BW

Tesseract +Conventional BW

Tesseract

Spee

du

p

HMC-MC Bandwidth (640GB/s) Tesseract Bandwidth (8TB/s)

Bandwidth

Programming Model

(No Prefetching)

Page 35: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Execution Time Breakdown

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

AT.LJ CT.LJ PR.LJ SP.LJ VC.LJ

Normal Mode Interrupt Mode Interrupt Switching Network Barrier

Page 36: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Execution Time Breakdown

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

AT.LJ CT.LJ PR.LJ SP.LJ VC.LJ

Normal Mode Interrupt Mode Interrupt Switching Network Barrier

Network Backpressure due to Message Passing(Future work: message combiner, …)

Page 37: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Execution Time Breakdown

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

AT.LJ CT.LJ PR.LJ SP.LJ VC.LJ

Normal Mode Interrupt Mode Interrupt Switching Network Barrier

Workload Imbalance(Future work: load balancing)

Page 38: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Prefetch Efficiency

1.04

0

0.2

0.4

0.6

0.8

1

1.2

Tesseract withPrefetching

Ideal

Spe

edu

p

Prefetch Timeliness

Prefetch Buffer Hit,

88%

Demand L1 Miss,

12%

Coverage

(Prefetch takes zero cycles)

Page 39: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Scalability

4x

12x13x

0

2

4

6

8

10

12

14

16

32Cores

128Cores

512Cores

512 Cores +Partitioning

Tesseract with Prefetching

+42%

0

2

4

6

8

10

12

14

16

32Cores

128Cores

Spee

du

p

DDR3-OoO

(8GB) (32GB) (128GB)

Memory-Capacity-

Proportional Performance

Page 40: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Memory Energy Consumption

0

0.2

0.4

0.6

0.8

1

1.2

HMC-OoO Tesseract with Prefetching

Memory Layers Logic Layers Cores

-87%

Page 41: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Memory Energy Consumption

0

0.2

0.4

0.6

0.8

1

1.2

HMC-OoO Tesseract with Prefetching

Memory Layers Logic Layers Cores

-87%

Memory Layers

Logic Layers

Cores

Page 42: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

Conclusion

• Revisiting the PIM concept in a new context– Cost-effective 3D integration of logic and memory

– Graph processing workloads demanding high memory bandwidth

• Tesseract: scalable PIM for graph processing– Many in-order cores in a memory chip

– New message passing mechanism for latency hiding

– New hardware prefetchers for graph processing

– Programming interface that exploits our hardware design

• Evaluations demonstrate the benefits of Tesseract– 14x performance improvement & 87% energy reduction

– Scalable: memory-capacity-proportional performance

Page 43: A Scalable Processing-in-Memory Accelerator for …omutlu/pub/tesseract-pim-architecture... · A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn,

A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing

Junwhan Ahn, Sungpack Hong*, Sungjoo Yoo, Onur Mutlu+, Kiyoung Choi

Seoul National University *Oracle Labs +Carnegie Mellon University