70
Avinoam Kolodny Threads, Caches and Networks in Chip-MultiProcessor Systems Electrical Engineering Department Technion Israel Institute of Technology ETNA 1 st International Workshop on E merging T opics in N oC-aware Computer A rchitecture ISCA 2013

Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Avinoam Kolodny

Threads, Caches and Networks in Chip-MultiProcessor Systems

Electrical Engineering Department

Technion – Israel Institute of Technology

ETNA –

1st International Workshop on Emerging Topics in NoC-aware Computer Architecture

ISCA 2013

Page 2: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Technion’s NoC & Architecture Collaborators:

– Israel Cidon

– Yoav Etzion

– Eby Friedman

– Ran Ginosar

– Idit Keidar

– Isaac Keslassy

– Avinoam Kolodny

– Avi Mendelson

– Uri Weiser

– …. And some very good students!

Page 3: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

3

Chips are similar to Cities

system complexity

is shown

in the interconnect

Page 4: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

If a chip is like a city,NoC is similar to a subway system

Page 5: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Changing view of VLSI systems

“Old” view:

• Communication is fast and free

• Execution time and energy are dominated by ALU operations

5

The truth is actually somewhere in the middle…

… that’s why NoC+CMP architecture is challenging!

“New” view:• Communication

dominates delay, power and cost

• Computing operations are fast and cheap

Page 6: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

On-Chip Interconnect is a Bottleneck:The challenge of wire design

6

Interconnect Delay

is dominant

Source: Bohr, IEDM 1995

Interconnect Power

is dominant

Interconnect

51%

Gate

34%

Diffusion

15%

* N. Magen, A. Kolodny, U. Weiser and

N. Shamir,, SLIP 2004.

(Data for Intel “Banias” centrino processor)

Page 7: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Network on-Chip (NoC)

Computing

module

Network

router

Network

link

Module

Module Module

Module Module

Module Module

Module

Module

Module

Module

Module

Network instead of dedicated wires and buses Inherently parallel

Efficient sharing of wires

Scalable, cost effective bandwidth

7

Page 8: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

8

Critical Problems Addressed by NoC

1) Global wire design(delay, power, noise, bandwidth, scalability, reliability issues)

2) System integrationproductivity problem

(key to modular design)

3) Building multi-core systems(key to power-efficient computing)

Module

Module Module

Module Module

Module Module

Module

Module

Module

Module

Module

Page 9: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Processor Evolution

CPU

Cache

Single

Core

CPU1

Cache

Dual Core

CPU2

Cache CPU1Cache

Quad Core

CPU3Cache

CPU2Cache

CPU4Cache

9

[Pollack]

Page 10: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Asymmetric (=Heterogeneous) Multi-Core

• Small cores of area: a

• Large core area: βa – used for serial code

• Parallel phases execute on all cores

βaSerial

Page 11: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

What do we know about

future systems?

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

Standard modules(DSP, HW accelerators,

Cache banks, etc.)13

High Certainty

Totally unknown

Large

number of

modulesNoC

Interconnect

Applications

Power-aware

Highly

parallel

Page 12: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Accessing On-Chip Cache banksthrough a NoC

14

0 7

56 63

P0 P1

P5 P4

P6

P7

P3

P2

Distributed L2

0 7

56 63

P0 P1

P5 P4

P6

P7

P3

P2

Distributed L2

Page 13: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

• Shared last level cache (LLC)

– Single copy no inter-cache coherence

• Banked , DNUCA– Interconnected using Network-on-Chip (NoC)

CPU0 CPU1 CPU3CPU2

CP40 CPU5 CPU7CPU6

CPU0 CPU1 CPU3CPU2

CPU4 CPU5 CPU7CPU6

Bank0 Bank1 Bank2 Bank3

Bank4 Bank5 Bank6 Bank7

0 7

56 63

P0 P1

P5 P4

P6

P7

P3

P2

Distributed L2

0 7

56 63

P0 P1

P5 P4

P6

P7

P3

P2

Distributed L2

[Beckmann et al. Micro’06] [Beckmann and Wood, MICRO’04]

Exploring Cache-In-the-Middle CMP

15

Page 14: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Shared data migrates to the center of the distributed cache – far away from clients

Longer access times

Remoteness of Shared Data

0 7

56 63

P0 P1

P5 P4P

6P

7

P3

P2

Distributed L2

0 7

56 63

P0 P1

P5 P4P

6P

7

P3

P2

Distributed L2

CPU0 CPU1 CPU3CPU2

CP40 CPU5 CPU7CPU6

CPU0 CPU1 CPU3CPU2

CPU4 CPU5 CPU7CPU6

Bank0 Bank1 Bank2 Bank3

Bank4 Bank5 Bank6 Bank7

16

Page 15: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

For many multithreaded applications:

Splash-2, SpecOMP, Apache, Specjbb, STM, ..

Observations on Memory Accesses

1. Access to shared lines is substantial

2. Shared lines are shared by many cores

3. A small number of lines make for a large fraction of the total accesses

A small number of lines, shared by many processors, is accessed numerous times

⇒ Shared hot lines effect

17

Page 16: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Shared Data Hinders Cache Performance

What can be done better?

Bring shared data closer to all processors

Preserve vicinity of private data

0 7

56 63

P0 P1

P5 P4

P6

P7

P3

P2

Distributed L2

0 7

56 63

P0 P1

P5 P4

P6

P7

P3

P2

Distributed L2

18

Page 17: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

This Has Been Addressed Before

19

Aerial view of Nahalal cooperative village

Overview of Nahalal cache organization

P0 P1

P2

P3P4P5

P6

P7

Page 18: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Nahalal Layout

• Partitioning of cache lines by “shared” vs. “private”

• Keep shared lines in the center

– Small & fast structure, close to all processors

• Use outer banks for private data

– Preserves vicinity of private data

* Guz et al., [SPAA-2008] , [CA-Letters’07]

CPU0

CP

U1

CPU2

CPU6

CP

U5

CPU4

CPU3CPU7

CPU0

CP

U1

CPU2

CPU6

CP

U5

CPU4

CPU3CPU7

20

Page 19: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3

equake fma3d barnes water apache zeus specjbb RBTree HashTable

Av

era

ge

D

ista

nc

e (

Nu

mb

er

of

Ho

ps

)

private linein CIM

private linein Nahalal

shared linein CIM

shared linein Nahalal

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3

equake fma3d barnes water apache zeus specjbb

Av

era

ge

R

ela

tiv

e D

ista

nc

e

private line

in CIM

private line

in Nahalal

shared line

in CIM

shared line

in Nahalal

Avera

ge R

ela

tive D

ista

nce

Average Relative Distance

Nahalal shortens the distance to shared data

Distance to private data remains roughly the same

Average Distance – Shared vs. Private

21

Page 20: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

26.8% improvement in average cache hit time

41.1% in apache

Average Cache Hit Time (cycles)

Cache Performance

0

5

10

15

20

25

30

35

40

45

50

equake fma3d barnes water apache zeus specjbb RBTree HashTable

Ca

ch

e A

cc

es

s T

ime

(C

loc

k C

yc

les

)

CIM

NAHALAL

# c

lock

cycl

es

3.9% 8.57%

40.53%

41.1%

29.06%29.35%39.4%

29.1%24.2%

22

Page 21: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Latency is an Issue in NoCs

Page 22: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

24

Latency Model

Latency Routing Delay WireDelay

24

2

1 2 2log

cyc clk

cyc c

Router Delay n t

n B pv B

2

2

int intR CUnrepeatedWire Delay l

0 int int int

0 00.7 0.4 0.7

R R

R

R C R Cl S C S C

S

RepeatedWire Delay

* L.-S. Peh and W.J. Dally, "A delay model and speculative architecture for pipelined routers“,2010

* H.B. Bakoglu, Circuits, Interconnections and Packaging for VLSI., 1990.

• Technology independent

model. Latency measured in

units of τ – inverter’s

switching delay

Page 23: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Ultimate Link Length

25

• Increasing wire speed: by widening, spacing and repeater insertion.

• Wire length reaches an ultimate limit - regardless of cost function.

• Maximal link-length decreases as technology advances.

0 10 20 30 40 50 600

1

2

3

4

5

6

7

8

9

10

Length [mm](a)

Co

st F

unct

ion

per

mm

Cost vs Length - 1GHz

Year 2009 -29nm

Year 2011 -24nm

Year 2013 -20nm

Year 2015 -17nm

Year 2017 -14nm

Year 2019 -12nm

Year 2021 -10nm

Year 2023 - 8nm

0 10 20 30 40 50 600

1

2

3

4

5

6

7

8

9

10

Length [mm](b)

Co

st F

unct

ion

per

mm

Cost vs Length - 2GHz

Year 2009 -29nm

Year 2011 -24nm

Year 2013 -20nm

Year 2015 -17nm

Year 2017 -14nm

Year 2019 -12nm

Year 2021 -10nm

Year 2023 - 8nm

0

10

20

30

40

50

60

29nm 24nm 20nm 17nm 14nm 12nm 10nm 8nm

Len

gth

[m

m]

Technology node

1GHz

2GHz

Length [mm]

Co

st

Fu

ncti

on

per

mm

* R. Manevich, L. Polishuk,I . Cidon, A. Kolodny, "Design Tradeoffs of Long Links in Hierarchical Tiled

Networks-on-Chip”, DSD EUROMICRO 2013.

Page 24: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Link Delay vs. Router Cycle

• For future technologies, link delay becomes worse

• When link delay is higher than router’s clock, link pipelining is needed

1

10

100

1000

29 27 24 22 20 18 17 15 14 13 12 11 10 9 8 7

Lin

k D

ela

y[τ

]

Technology nodes[nm]

Link Delay -16x16

Router cycle[τ] -vc=2

1

10

100

1000

29 27 24 22 20 18 17 15 14 13 12 11 10 9 8 7

Lin

k D

ela

y[τ

]

Technology nodes[nm]

Link Delay -16x16

Link Delay -32x32

Router cycle[τ] -vc=2

1

10

100

1000

29 27 24 22 20 18 17 15 14 13 12 11 10 9 8 7

Lin

k D

ela

y[τ

]

Technology nodes[nm]

Link Delay -16x16

Link Delay -32x32

Link Delay -64x64

Router cycle[τ] -vc=2

26

Page 25: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Latency is a Basic Disadvantage in NoCs(for global packets which cross many routers)

In large systems, with traffic modeled by Rent’s rule,global packets (minority):

Consume most of the network’s BW.

Significantly increase the average latency at light load .

* R. Manevich, I. Cidon, and A. Kolodny, "Handling Global Traffic in Future CMP NoCs", SLIP 2012

Page 26: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Latency grows even worsewhen the NoC is loaded

Page 27: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Typical Latency vs. Load

Light Load

Latency

A loaded network quickly reaches a saturation point!

Page 28: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Can NoC latency be reduced?

(“Ideas for CMP-aware NoC”)

Page 29: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Reducing hop-countExample: PyraMesh Topology

Overall hops-count is reduced.

Average latency is

reduced.

Average BW per router is reduced.

• Hierarchical 2D mesh.

• Global packets are routed through higher hierarchy levels.

12345678 hops

instead of 14!

Source

Dest.

* R. Manevich, I. Cidon, and A. Kolodny, "Handling Global Traffic in Future CMP NoCs", SLIP 2012

Page 30: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

0.4 0.5 0.6 0.7 0.8 0.90

5

10

15

20NoC Size: 64

(a)

Late

ncy[

Clk

cycl

e]

0.4 0.5 0.6 0.7 0.8 0.90

10

20

30NoC Size: 256

(b)

0.4 0.5 0.6 0.7 0.8 0.90

10

20

30

40

50NoC Size: 1024

Rent coefficient - r(c)

Late

ncy[

Clk

cycl

e]

0.4 0.5 0.6 0.7 0.8 0.90

20

40

60

80NoC Size: 4096

Rent coefficient - r(d)

HNoC Simple Mesh PyraMesh EVC Boundary

Average Latency: Comparison of NoC topologies

for a wide range of Rentian traffic loads

32

Average Maximal

Latency speedup vs

Simple Mesh

1.55X 2.05X

Latency speedup vs

2nd best

1.21X 1.64X

* R. Manevich, L. Polishuk, I. Cidon, A. Kolodny, To be published, 2013.

Page 31: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Some Sad Observations

• There is no “perfect” topology– you need to know your traffic model to choose a

network topology

• Choosing the most suitable topology for your type of traffic can reduce the latency by no more than ~2X (at light loads).

• What can be done to prevent additional, congestion-related delays?

33

Page 32: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Ideas for improving latency of Cache traffic in NoC

34

Page 33: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

35 E. Bolotin – The Power of Priority, NoCs 2007

Issues in NUCA-based CMP

0 7

56 63

P0 P1

P5 P4

P6

P7

P3

P2

Distributed L2

• Each cache access Multiple Noc transactions

• NoC performance CMP performance

• Cache coherency and transaction order (correctness)

• Search (in DNUCA)

• Different traffic types (e.g. fetch vs. prefetch)

• Synchronization (locks)

Need specialized NoC

Services for CMP!

Page 34: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

41 E. Bolotin – The Power of Priority, NoCs 2007

Observations on Cache Access

- Delay = Queueing + NoC transactions

- NoC transactions consist of:

• Short ctrl. packets

• Long data packets

Idea: Differentiate between Control and Data

Solution: Preemptive Priority NoC Give priority to short control packets

L2

Dire

cto

ry

NoC

No

C

No

C

P1L1

P2L1

P0L1

4. IN

VALI

D. R

EQ

3. READ EXCL. REQ

6. Read EXCL. RESP

(data transfer)

5. INVALID. ACK

5. IN

VA

LID

. A

CK

P0-MOD.

Page 35: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

42 E. Bolotin – The Power of Priority, NoCs 2007

Preemptive Priority NoC: QNoC

Multiple SL link

QNoC

Input ports Output ports

BufSize

SL 0

SL 1

CR

OS

S-B

AR

Scheduler CREDITControlCREDIT

SL 2

SL 3

SL 0

SL 1

SL 2

SL 3

Physical Link

Output Input

SL 0

SL 1

SL 2

SL 3

SL 0

SL 1

SL 2

SL 3

Service Levels:

• Dedicated wormhole buffer

• Preemptive priority scheduling

Multiple SL Router

Page 36: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

43 E. Bolotin – The Power of Priority, NoCs 2007

Priority NoC: Several Benchmarks

L2 Access Delay Reduction by Priority-based NoC

22.6

31.8

19.6

28.4

13.5

25.3

18.3

32.9

22.3

28.0

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

apache zeus fft ocean radix

De

lay

Re

du

cti

on

[%

]

Read Read Exclusive

Delay Reduction Program Speedup

Total Program Speedup by Priority-based NoC

9.48.7

9.08.6

5.0

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

apache zeus fft ocean radix

Sp

ee

du

p [

%]

*E. Bolotin, Z. Guz, I. Cidon, R. Ginosar and A. Kolodny, "The Power of Priority: NoC based Distributed Cache Coherency",

NOCS 2007, Princeton, NJ, May 2007.

Page 37: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

• …… chips are so small…..

• Idea: Use centralized mechanisms in NoCs!

44

Should we regard NoCs as truly distributed systems?

Page 38: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

centralized mechanism example 1:

Bus-Enhanced NoC (BENoC)

Page 39: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

• Motivation

– NoCs have high bandwidth, but latency suffers

– Group communication is expensive

Idea of Bus-Enhanced NoC

Approach

Embed a bus to achieve synergy

Optimize: bus for latency, NoC for bandwidth

Use bus for meta-data onlyR

RR RR

R

R

R RR

R

R

R R

R

R

R

R R

R

R

R

R

R

RR

RR

R

R

R

R

Page 40: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Bus-Enhanced NoC (BENoC)

• Bus re-introduced as a NoC “add-on”

47

Use NoC for data

Optimized for high bandwidth

Use bus for short meta-data Low bandwidth, low latency

Broadcast, multicast

R

RR RR

R

R

R RR

R

R

R R

R

R

R

R R

R

R

R

R

R

RR

RR

R

R

R

R

Module Module

Module Module

Module Module

Module Module

Module

Module

Module

Module

Module

Module

Module

Module

*R. Manevich, I. Walter, I. Cidon and A. Kolodny, "Best of Both Worlds: A Bus-Enhanced NoC (BENoC)",

NOCS 2009, San Diego, CA, May 2009

Page 41: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

BENoC Services

• Fast unicast and multicast signaling

– CMP cache example

• Anycast

– Find resources that fulfills certain conditions

– E.g., “Looking for an idling DSP”; or

“Where are the 5 closest multipliers?”

• Convergecast

– Efficient collection of feedback back to the initiator

• Barrier synchronization, …

48

Page 42: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Bus-enhancesd NoC for DNUCA

• Split large cache into independent smaller banks– Non uniform cache access time (NUCA)

• Cache lines are moved to shorten access time– Dynamic NUCA

• Before fetching a into its L1$, a CPU needs to find the L2 cache storing the line

CPU

L1$

L2$ L2$

L2$ L2$

L2$ L2$

L2$ L2$

L2$ L2$

L2$ L2$

L2$ L2$

L2$ L2$

CPU

L1$

CP

U

L1

$

CP

U

L1

$

CPU

L1$

CPU

L1$

CP

U

L1

$

CP

U

L1

$

L2$

51

Page 43: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Simulation of DNUCA with Bus-enhanced NOC

Performance improvement in BENoC compared to a NoC-based

CMP

(a) average read transaction latency; (b) application speed

53

Page 44: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

centralized mechanism example 2:

Centralized Adaptive Routing

Page 45: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Route Selection

ATDOR - Adaptive Toggle Dimension Ordered Routing

Keep it simple! Centralized selection:

The option with less congested bottleneck link is preferred.

Page 46: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Centralized Adaptive Routing

Congestion aggregation

Routing control

Congestion data collection within the routers

* R. Manevich, I. Cidon, A. Kolodny, and W. Isask'har, "Centralized Adaptive Routing for NoCs,"

Computer Architecture Letters , vol.9, no.2, pp.57-60, Feb. 2010.

Page 47: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

centralized mechanism example 3:

GANA: Global Arbitration NoC Architecture

Page 48: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Global Arbiter NoC Architecture

• An overlay of a Data and Control

GAU

NoC like

Wires and Simple Routers

Global Arbiter

Request and Grant Lines

Page 49: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

GANA: Global Arbitration NoC Architecture

• A New NoC Architecture

• Power is 76% @ 0.25 load and 62% @ 0.75 load

• Area is 16% of a baseline NoC

• Single cycle latency per hop

• No Head-of-Line blocking

• No parking-lot effect – Fairness imposed* E. Zahavi, I. Cidon and A. Kolodny, "GANA: A Novel Low Cost Conflict Free NoC Architecture,"

ACM Transactions on Embedded computing, 2013.

Page 50: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Need for Heterogeneous NoCs

Page 51: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

CMP Bandwidth Requirements

• Different links in NoC-based CMP need different throughput capacities!

– Typically, links at the center carry more traffic.

Page 52: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

NoC-Based CMP Example –Non-uniform traffic

• 3 different types of links:1. DRAM to L2$

• 22 GBps and 2 VCs

• handle a miss read

2. L2$ to DRAM

• 12 GBps and 2 VCs

• Block replacements during miss handling in the L2$

3. Cores <-> L2$

• 3 GBps and 1 VC

Legend: C – Core + L1 cache ; $ - L2 cache ; D – DRAM controller.

* The link thickness corresponds to its

capacity

Page 53: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Heterogeneous NoC Router Architecture for CMPs

67* I. Ben-Itzhak, I. Cidon and A. Kolodny, to be published, 2013.

Page 54: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

What can be donein processors and software?

(“Ideas for NoC-aware computing”)

Page 55: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

A Unified Machine Model

• Use both cache and many threads to shield memory access

– Derive simple equations for performance, power, BW,..

69

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

Cache

To External Memory

Threads Architectural States

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C

C

C

C

C C

C C

C C

C C

C

C

C

C

* Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson and U. Weiser, Many-Core vs. Many-Thread

Machines: Stay Away From the Valley", IEEE Computer Architecture Letters, Volume 8, Issue 1, Jan. 2009

Page 56: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

A Useful Plot

for Multi-Threaded Systems

70

Number of Threads

Performance

Page 57: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Cache Machines

• Many cores (each may have its private L1) behind a shared cache

71

C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C

C

C

C

Cache

To Memory

C

C

C

C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

# Threads

Performance

Cache Non-Effective point

(more threads ► lower hit-rate)

Page 58: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Multi-Thread Machines

• Memory latency shielded by multiple thread execution

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

To Memory

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

Threads Architectural States

Ban

dw

idth

Lim

itati

on

s

# Threads

PerformanceMax performance

executionMemory access

72

Page 59: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Unified Machine Performance

• 3 regions: Cache efficiency region, The Valley, MT efficiency region

77

# Threads

Perf

orm

an

ce

Ca

ch

e re

gio

n

MT regionThe Valley

* Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson and U. Weiser, Many-Core vs. Many-Thread

Machines: Stay Away From the Valley", IEEE Computer Architecture Letters, Volume 8, Issue 1, Jan. 2009

Page 60: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Three applications families based on cache miss rate dependency: A “strong” function of number of threads – f(Nq) when q>1

A “weak” function of number of threads - f(Nq) when q≤1

Miss rate is not affected by number of threads

Threads

Perf

orm

an

ce

Hit Rate Dependency – 3 ClassesP

erf

orm

an

ce

# Threads

78

Page 61: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Example: Canneal - simulation results from PARSEC workloads

Not enough parallelism available!

Investigating Workload Parallelism

Canneal

0

2

4

6

8

10

12

14

16

18

20

22

24

26

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Simulation

Analytical Model

Cache Hit Rate

79* Z. Guz, O. Itzhak, I. Kediar. A. Kolodny, A. Mendelson and U.C. Weiser, "Threads vs. Caches: Modeling the

Behavior of Parallel Workloads", ICCD 2010

Page 62: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Inherent Program Scalability Study

• Capture the parallelism limitation of the algorithm

• Use architecture model with no parallelism limiters– No shared resources (e.g. cache, bandwidth)

– Perfect memory system – 1 cycle latency

• Focusing on inter-thread synchronization– Using a special simulator

* O. Itzhak, I. Keidar, A. Kolodny and U. Weiser, To be published, 2013.

Page 63: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Perfect parallelism scalability: blackscholes

Page 64: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Good parallelism scalability: fluidanimate

Page 65: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Poor parallelism scalability raytrace

Page 66: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

What can be done when NoC latencies become

dominant?

• More parallelism? Efficient thread switching?

• More locality?

• Special attention to shared data?

• Special attention to meta-data?

Page 67: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Memory Intensive Machines

• Reducing BW (i.e. power) can be achieved by climbing up a constant-throughput-curve

• increase on-die-memory (e.g. innovative cache, new ideas….?)85

TP/BW

TP1

TP2TP3

TP4

Page 68: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Memrisor Opportunities

• 3D memory - above CMOS logic

• Nonvolatile

• High density

• “For free”

86

Sea of nonvolatile memory

above the logic

Page 69: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Deep Pipeline with Memristor-based Thread Reservoir

• Use memristors to reduce thread switch penalty

• At switch time:

– Instead of flush, store the thread state in memristors

– Load pipeline stages for different thread from memristors

87* S. Kvatinsky et al., Computer Architecture Letters, 2013

Page 70: Threads, Caches and Networks in Chip-MultiProcessor Systems...–Idit Keidar –Isaac Keslassy –Avinoam Kolodny –Avi Mendelson –Uri Weiser –…. And some very good students!

Summary

• Distances and associated latencies lead to interesting tradeoffs in NoC-based system architecture!

89

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R