On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Preview:

Citation preview

On-chip Network forManycore Architecture

Myong Hyon “Brandon” Cho

Multicore to Manycore?

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

© Tilera Corporation

Intel Xeon E7-x8xx

10 cores

32nm

2011

Westmere-EX architecture

2.4GHz, 30MB L3, 130W(E7-8870)

© Intel Corporation© Advanced Micro Devices, Inc.

AMD FX 8-core

8 cores

32nm

2012

Vishera (Bulldozer/Piledriver)architecture

4.0GHz, 8MB L3, 125W(FX-8350)

Tilera TILE-Gx72

72 cores

40nm

2013

TILE-Gx architecture

1.0GHz, 18MB L3, ~60W

Multicore as the only way out

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

Transistors (in thousands)

Data credited to Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanovic

1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

Transistors (in thousands)

Frequency (MHz)

Performance

1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

Transistors (in thousands)

Frequency (MHz)

Performance

1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

Transistors (in thousands)

Frequency (MHz)

Performance

1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

Transistors (in thousands)

Frequency (MHz)

Performance

Number of cores

vs. Other possibilities

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

© Wikipedia / Jurii

SiGe?

© Wikipedia / AlexanderAIUS

Graphene?

© iStockphoto / Andrey Volodin

Organic?

© The Economist

Quantum?

vs. Other possibilities

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

NehalemTylersburgWestmere

Sandy BridgeRomleyIvy Bridge

HaswellHaswellRockwell

SkylakeSkylakeSkymont

2009 2010 2011 2012 2013 2014 2015 2016 2017 2018

45nm 32nm 22nm 14nm 10nm

Intel Server Microarchitecture Roadmapaccording to computerbase.de, 2011

NoC as the key to manycore success

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

realizes every communication between cores.

On-chip network

consumes energy proportionally to traffic size.

provides key mechanisms for parallel programming.

Outline

NoCfor

Manycore

Network-level

Optimization

Physical-level

Design

@ 45nm

System-level

Optimization

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

PROM

NoCARC’09

ENC

NOCS’11

BAN

PACT’09

EM2 Chip

’12/’13

Network-level Optimization:

As simple as oblivious network,As efficient as adaptive network

PROM – path-based oblivious routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Path-based, Randomized, Oblivious, Minimal RoutingMyong Hyon Cho, Mieszko Lis, Keun Sup Shim, Michel Kinsy, and Srinivas Devadas

NoCArc’09

overcomes the limitation of oblivious routing by enhanced path diversity.

Oblivious routing vs Adaptive routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Local and Simple

Oblivious routing

Possibly poor resource utilization

Possibly betterresource utilization

Adaptive routing

Global informationrequired

Oblivious routing vs Adaptive routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Local and Simple

Oblivious routing

Possibly poor resource utilization

Possibly betterresource utilization

Adaptive routing

Global informationrequired

For on-chip networks…

Because performance/area overhead of adaptive routing is more significant in on-chip networks than in large-scale networks.

Poor utilization of oblivious routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

SB

DB

DA

SA

DOR (XY)

Path diversity improves oblivious routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

SB

DB

DA

SA

O1TURN

• Diversity helps improve utilization and reduce congestion.

Path diversity improves oblivious routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• Diversity helps improve utilization and reduce congestion.

IA

SB

DB

IB DA

SA

SB

IA

DB

DA

SA

IBIB DB

SA

SB

DA

IA

Valiant ROMM (2-phase)

Network-level deadlock

• A dependency cycle on network resources causes network-level deadlocks.

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Q1

Network-level deadlock

• A dependency cycle on network resources causes network-level deadlocks.

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

x

Q1

Q2

Q1

Q2

Channel Dependency Graph (CDG)

Network-level deadlock

• A dependency cycle on network resources causes network-level deadlocks.

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

x

x

Q1

Q3

Q2

Q1

Q2

Q3

Channel Dependency Graph (CDG)

Network-level deadlock

• A dependency cycle on network resources causes network-level deadlocks.

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

x

x

x

x

Q1

Q3

Q2Q4

Q1

Q2

Q3

Q4

Channel Dependency Graph (CDG)

Deadlock prevention

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

DOR never creates dependency cycles.

XY and YX paths of O1TURN cause cycles.

O1TURN requires 2 networks to separate them.

Each phase of ROMM cause cycles.

n-phase ROMM uses n networks to separate them.

Each phase of Valiant cause cycles.

Valiant requires 2 networks to separate them.

…which we found to be wrong!n-phase ROMM only requires 2 networks.

Various oblivious routing schemes

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

DOR O1TURN2-phase ROMM

n-phase ROMM

Valiant

Path diversity None Minimum Limited Fair~Large Large

# networksfor deadlockprevention

1 2 2n

*erroneouslyproposed

2

# hops minimal minimal minimal minimal non-minimal

Comm. overhead

None Nonelog2(N)bits/pkt

(n-1) log2(N)bits/pkt

log2(N)bits/pkt

PROM Routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Path-based

Oblivious Minimal

Randomized

Goal: Best minimal-path diversity

- Use ALL possible minimal routes- Each minimal route has the SAME CHANCE to be taken.

PROM Routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

DA

SA

Path-based

Oblivious Minimal

Randomized

At each hop, where there are multiple choices,

25%

75%

…compare the number of possible minimal paths after each choice

PROM Routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

DA

SA 75%

Path-based

Oblivious Minimal

Randomized

At each hop, where there are multiple choices,

…compare the number of possible minimal paths after each choice

PROM Routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

DA

SA

33%

67%75%

Path-based

Oblivious Minimal

Randomized

At each hop, where there are multiple choices,

…compare the number of possible minimal paths after each choice

PROM Routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

DA

SA 67%75%

Path-based

Oblivious Minimal

Randomized

At each hop, where there are multiple choices,

…compare the number of possible minimal paths after each choice

PROM Routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

DA

SA 67%75%

50%

50%

Path-based

Oblivious Minimal

Randomized

At each hop, where there are multiple choices,

…compare the number of possible minimal paths after each choice

PROM Routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

DA

SA 67%75%

50%

Path-based

Oblivious Minimal

Randomized

At each hop, where there are multiple choices,

…compare the number of possible minimal paths after each choice

PROM Routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Path-based

Oblivious Minimal

DA

SA 67%75%

50%100%

Randomized

The chance of this path to be taken is:

75%×67%×50%×100%= 25%

At each hop, where there are multiple choices,

…compare the number of possible minimal paths after each choice

Probability Calculation

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• The probability function is reduced to a simple ratio.

Y

DA

SA

X

x

y

NY = (x+y-1)!x!(y-1)!

NX = (x+y-1)!(x-1)!y!

PY = NY

NX+NY

X+y

y =

PX = X+y

x When X>0 and y>0

= x!(y-1)!

1

( + ) x!(y-1)!

1

(x-1)!y!

1

PX PY

X+yx

X+yy

Large-box Problem

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• Paths are equally taken, but links are not.

srcdst

link utilization on the minimal-path box

DA

SA

When the MPB is large- edges are underutilized.- inner links are congested,possibly with other flows inside.

Uniform PROM

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Immediate Upstream Router

PX PY

Don’t careX+y

x X+y

y

Parameterized PROM

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Immediate Upstream Router

PX PY

On the X axis

On the Y axis

X+y+fx+f

X+y+fy

X+y+fx

X+y+fy+f

Parameterized PROM

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

f=10 f=25f=0

link utilization on the minimal-path boxparameterized PROM

Deadlock prevention

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• Turn Models [Glass et al./J.ACM’94]:- Each turn model is a set of allowed turns.- No deadlock if all routes conform to the same turn model.

West-First Turn Model North-Last Turn Model

Deadlock prevention

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Any minimal routing on a 2D mesh network conforms to either one of two turn models.*

* Keun Sup Shim, Myong Hyon Cho, Michel Kinsy, Tina Wen, Mieszko Lis, Edward Suh, and

Srinivas Devadas, Static Virtual Channel Allocation in Oblivious Routing, NOCS’09

No north-east nor south-east turnsconforms to the West-First turn model

No north-west nor south-west turnsconforms to the North-Last turn model

Performance Evaluation

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Performance Evaluation

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Various oblivious routing schemes

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

DOR O1TURN2-phase ROMM

n-phase ROMM

Valiant PROM

Path diversity

None Minimum Limited Fair~Large Large Fair~Large

# networksfor deadlockprevention

1 2 2 n* 2 2

# hops minimal minimal minimal minimalnon-

minimalminimal

Comm. overhead

None Nonelog2(N)bits/pkt

(n-1) log2(N)bits/pkt

log2(N)bits/pkt

None

Heavy-loadPerformance

Fair Good Bad Worst Worst Best

BAN – bandwidth adaptive network

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

achieves adaptivity with oblivious routing, using locally arbitrated bi-directional network links.

Oblivious Routing in On-Chip Bandwidth-Adaptive NetworksMyong Hyon Cho, Mieszko Lis, Keun Sup Shim, Michel Kinsy, Tina Wen, and Srinivas Devadas

PACT’09

Oblivious routing failure

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

SA

SB

DB

DA

congested

Where can we do better?

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Adaptive Network, not routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

SA

SB

DB

DA

Increasedbandwidth

• A set of bidirectional links connects network nodes.- The bandwidth of the link in one direction can be increased at the expense of the other direction.

Adaptive Network, not routing

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

SA DB

DASB

SA DB

DASB

(a)When yellow flow is dominant

(b)When gray flow is dominant

Routes do not change, and arbitration is all local.

BAN Hardware

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Most hardware overhead in the crossbar

BandwidthAllocatorpressure pressure

direction

1-to

-v D

EM

UX

(1, …, v)

v-to

-1 M

UX

Xbarswitch

1-to

-v D

EM

UX

(1, …, v)

v-to

-1 M

UX

Xbarswitch

nop

nop

from other nodes from other nodes to other nodes

to other nodesto other nodes

to other nodes

Crossbar – 2 links, Unidirectional

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• 4-input, 4-output, 4 Virtual Channels

Crossbar– 2 links, Bidirectional

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• 4-input, 4-output, 4 Virtual Channels

Links Switch# xBar Inputs

# xBar Outputs

Relative xBar Size

Unidirectional

VC-to-Port(fully connected) 16 4 64

Bidirectional

VC-to-Port(fully connected) 16 8 128

Crossbar Size – 2 links

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• 4-input, 4-output, 4 Virtual Channels

Links Switch# xBar Inputs

# xBar Outputs

Relative xBar Size

Unidirectional

VC-to-Port(fully connected) 16 8 128

Bidirectional

VC-to-Port(fully connected) 16 16 256

Hybrid

VC-to-Port(fully connected) 16 12 192

Crossbar Size – 4 links

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

The hybrid configuration has a 1.5 times larger crossbar, which typically increases the node size by around 15%.

Bandwidth Allocation

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• Local arbiters between any two adjacent routers

Bandwidth Arbiter3 flits 1 flit

The arbitration follows demands from each router, always leaving at least one link in one direction

if there is any flit that can move in that direction.

Symmetry vs. Anti-symmetry

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Bit-complement Transpose

*Both under dimension order routing

Anti-symmetric Traffic

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Symmetric Traffic

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Symmetric Traffic with Burstiness

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Traffic Pattern Non-bursty Bursty

Bit-complement 0% 20%

Uniform Random 8% 26%

How about real application traffic…?

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• The traffic patterns in many real applications are not symmetric as data is processed by a sequence of modules.

System-level Optimization:

autonomous & fine-grainedthread migration protocol by NoC

ENC – exclusive native context

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

provides the first deadlock-free protocol for autonomous thread migration for any microarchitecture.

Deadlock-Free Fine-Grained Thread MigrationMyong Hyon Cho, Mieszko Lis, Keun Sup Shim, Omer Khan, and Srinivas Devadas

NOCS’11 – Best Paper Award

Why thread migrations again?

• For a simple reason: it’s cheaper on a single die (so we can do it more often).

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

ThreadMotion [Rangan et al., ISCA09]

Higher Voltage/Frequency

Lower Voltage/Frequency

cache misses cache hits

Why thread migrations again?

• For a simple reason: it’s cheaper on a single die (so we can do it more often).

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Architectural Core Salvaging [Powell et al., ISCA09]

has no defectsfloating-point ops

has a defective floating-pointunit

Why thread migrations again?

• For a simple reason: it’s cheaper on a single die (so we can do it more often).

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Execution Migration Machine (EM2) [Lis et al., SPAA11/CSAIL-TR]

Each has the only copy of data on-chip.data misses

Migration protocols aren’t catching up...

• …use a centralized scheduler (e.g., an OS). - slow!

• …store contexts in extra buffer or in the memory hierarchy.- expensive and inefficient!

• …bring restrictions on how threads can migrate.- cannot exploit the full power of migration!

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Need a fast migration protocol that...

• …provides functional correctness for arbitrary migrations.

• …supports autonomous migration scheduling.

• …with a simple & small implementation.

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Protocol-level Deadlock

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Core C

Router C

Core D

Router D

F

E

D

D

A

B

C

C

Core E

Router E

Core F

Router F

Core A

Router A

Core B

Router B

If an autonomous migration protocol is careless…

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORYMIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• SWAP : A deadlock-prone autonomous migration protocol

• An eviction swaps the locations of two threads.

threads

Protocol-level Deadlock

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• 100 random, synthetic migration patterns.• 64 threads on 64 core, migrating in every 100 cycles• Network-level deadlock-free routing (DOR-XY)

1 2 3 40

10

20

30

40

50

60

70

80

90

100

2 VCs / No Buffer4 VCs / No Buffer2 VCs / 4 contexts2 VCs / 8 contexts

Number of Hotspots

Dea

dlo

ck (

%)

Exclusive Native Context(ENC) protocols

• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

core

a running thread A

Exclusive Native Context(ENC) protocols

• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

coremigration

a running thread A

eviction

Exclusive Native Context(ENC) protocol

• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

coremigration

eviction

a running thread A

migrating threads must not block evicted threads.

Exclusive Native Context(ENC) protocol

• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

coremigration

eviction

a running thread A

Separating virtual channel sets is a simple solution.

Exclusive Native Context(ENC) protocol

• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

coremigration

eviction

a running thread A

native core

exclusivespace

Each thread has its own native core.

Exclusive Native Context(ENC) protocol

Application performance results

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• Total migration distance : no overhead in real applications

RANDOM

FFT RADIX LU OCEAN WATER0

0.2

0.4

0.6

0.8

1

1.2

SWAP

SWAPinf

ENC

DEA

DLO

CK

DEA

DLO

CK

DEA

DLO

CK

Nor

mal

ized

Tot

al H

op C

ount

RANDOM FFT RADIX LU OCEAN WATER0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2SWAP SWAPinf ENC

Nor

mal

ized

Com

pleti

on T

ime

Application performance results

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• Completion time : 11.7% overhead of ENC over SWAPinf (on avg.)

DE

AD

LO

CK D

EA

DL

OC

K DE

AD

LO

CK

Physical-level Design:

NoC router implementation for EM2 (IBM SOI 45nm)

EM2 Implementation - Overview

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

110-core Shared Memory Processor

ISA EM2 Stack ISA

Shared MemoryArchitecture

1. EM2

2. RA (Remote Access)3. EM2+RA

Cache 8KB I$ / 32KB D$ at each core Total 4.4MB on Chip Single-cycle read hits, two-cycle write hits

Technology IBM SOI12SO 45nm

IPARM sc12 library (High voltage threshold),IBM SRAM compiler,IBM IO library (wire-bonding), IBM PLL, etc.

NoC router specification for EM2

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Channels

Communication Unicast, in-order

ArchitecturalPerformance 1 cycle/hop

SchedulingAlgorithm Maximal scheduling

Routing DOR

Network Buffer Single 4-flit ingress buffer for each port

Remote Access

Migration (EM2)

DRAM Access

Migration

Eviction

Request

Response

Request

Response

Six independent 64-bit channels

6 Independent Physical Networks

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

330um

330

um

6-network router with maximal scheduling

Metal Layers Usage

m1, m2, m3 Local logic

c1, c2 Local routing

b1, b2, b3Remote routing/ power grid

ua, ub Global power grid

lb Chip IO

Six 64-bit networks needs a width of 222um.

Tile Floorplanning

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Router

Core

32KB D$

Pre

dict

or

8KB I$

Tile Floorplanning

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Tile floorplan for EM2 tile

855um

917

um

Tile Floorplanning

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Placement Results

ROUTER

CORE

PREDICTOR

EM2 tile

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Width 855um

Height 917um

RC extracted STA(@typical)

WorkingFrequency

105MHz

Hold timeSlack

0.2ns

PowerEstimation (10% activity)

50mW

D$ D$ D$ D$ I$ I$

D$tags

I$tags

Chip Floorplanning

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Connecting Router Links

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Chip-level Clock Tree

B

Tile-level Clock Tree

A

EM2 chip

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Width 10mm

Height 10mm

~357 Million Transistors

11-by-10EM2 tile array

CLKD-CAPs D-CAPs

I/O

18man-month

EM2 tile arraybelow

the top 2 metal layers

More Link Bandwidth?

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Wires connecting to router pins

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

EM2 only(no RA)

BarnesLU-contiguous

Ocean-contiguous

RadixWater-n-squared

Maximum 5 18 15 64 5

Average 2.2 1.6 6.8 4.1 2.1

Thread Concentration on 64-core EM2

* simulated for a 64-core version EM2

Application Migration Patterns

Applications can saturatethe resource cap

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

In YX routing, threads going into the ‘hot core’ are more congested on the horizontal links.

Applications can saturatethe resource cap

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

In YX routing, threads evicted from the ‘hot core’ are more congested on the vertical links.

Applications can saturatethe resource cap

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

In YX routing, threads evicted from the ‘hot core’ are more congested on the vertical links.

BAN on EM2 (Simulation study)

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

0

0.2

0.4

0.6

0.8

1

1.2Average Migration Latency

UN BAN

Nor

mal

ized

Mig

ratio

n La

tenc

yEM2 only(no RA)

BarnesLU-contiguous

Ocean-contiguous

RadixWater-n-squared

Maximum 5 18 15 64 5

Average 2.2 1.6 6.8 4.1 2.1

BARNES LU OCEAN RADIX WATER

* simulated for a 64-core version EM2

WATER

Outline

NoCfor

Manycore

Network-level

Optimization

Physical-level

Design@ 45nm

System-level

Optimization

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

PROM

NoCARC’09

ENC

NOCS’11

BAN

PACT’09

EM2 Chip

’12/’13

Extra slides

Links Switch# xBar Inputs

# xBar Outputs

Relative xBar Size

Unidirectional

VC-to-Port(fully connected) 16 4 64

Bidirectional

VC-to-Port(fully connected) 16 8 128

Crossbar Size – 2 lanes

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• 4-input, 4-output, 4 Virtual Channels

Unidirectional

Port-to-Port(w/ input VC mux) 4 4 16

Bidirectional

Port-to-Port(w/ input VC mux) 8 8 64

Link Arbitration Frequency

93

• How frequently directions need to change?

• Few links change their directions in 10~20 cycles.

Infrequent Link Arbitration

94

unidirectional

N=100

N=1

Infrequent Link Arbitration

95

unidirectional

N=100

N=1

Protocol-level Deadlock

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Router Cto

Router D

Core Cto

Router C

Router Dto

Core D

Core Dto

Router D

Router Dto

Router CRouter C

toCore C

D to C

C to D

Packets are assumed tobe consumedat the destination.

Packets are assumed tobe consumedat the destination.

Cyclic Resource Dependency Graph

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

node1

core1 core2

NetN2

NetN1

C1N1

N1C1

N2C2

C2N2

node2Network

migration

Acyclic Resource Dependency Graph

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

node1

core1 core2

NetN2

NetN1

C1N1

N1C1

N2C2

C2N2

node2

N2Net

NetNative

NetNative

N1Net

Network

migration

eviction

• ENC0 : A thread always visits its native core first!

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

threads

native cores

Exclusive Native Context Zero (ENC0)

Exclusive Native Context (ENC)

• ENC0 : A thread always visits its native core first!

• ENC : A thread goes to its native core only if evicted by another thread.

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

threads

native cores

• ENC saved 10 network hops (52.6%) in this example.

• Moving out a thread context must be atomic (extra logic cost).

Exclusive Native Context (ENC)

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

threads

native cores

• ENC saved 10 network hops (52.6%) in this example.

• Moving a thread context onto the network must be atomic.

A B

Execution Migration Machine (EM2)

• In many parallel applications, each thread mostly works on its private data.

• In EM2, a migrating thread mostly returns to a specific core.

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Memory accesses on home core

Round Robin Scheduling

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

“N” “E” “W” “S” “C”

RR counter

+1

MUX

wins the output port

“Bubble” cycles when no flit is available on an Input port (non-maximal).

Maximal Scheduling

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

“C” “N” “E” “W” “S”

MUX

“S” “C” “N” “E” “W”

MUX

“W” “S” “C” “N” “E”

MUX

“E” “W” “S” “C” “N”

MUX

“N” “E” “W” “S” “C”

MUX

Fixed Priority Logic (left-to-right)

RR counter

+1

wins the output port

Maximal scheduling without bubblesArea cost: 6.7% (Tile)

Application performance results

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• Total migration distance : no overhead in real applications

RANDOM FFT RADIX LU OCEAN WATER0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

SWAP SWAPinf ENC0 ENC

Nor

mal

ized

Hop

Cou

nt

DE

AD

LO

CK D

EA

DL

OC

K DE

AD

LO

CK

Application performance results

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• Completion time : 11.7% overhead of ENC over SWAPinf (on avg.)

RANDOM FFT RADIX LU OCEAN WATER0

0.20.40.60.8

11.21.41.61.8

2SWAP SWAPinf ENC0 ENC

Nor

mal

ized

Com

pleti

on T

ime

DE

AD

LO

CK D

EA

DL

OC

K DE

AD

LO

CK

Recommended