On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

On-chip Network forManycore Architecture

Myong Hyon “Brandon” Cho

Multicore to Manycore?

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Intel Xeon E7-x8xx

10 cores

Westmere-EX architecture

2.4GHz, 30MB L3, 130W(E7-8870)

AMD FX 8-core

8 cores

Vishera (Bulldozer/Piledriver)architecture

4.0GHz, 8MB L3, 125W(FX-8350)

Tilera TILE-Gx72

72 cores

TILE-Gx architecture

1.0GHz, 18MB L3, ~60W

Multicore as the only way out

1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

Transistors (in thousands)

Data credited to Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanovic

1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

Frequency (MHz)

Performance

1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

Frequency (MHz)

Performance

1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

Frequency (MHz)

Performance

1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

Frequency (MHz)

Performance

Number of cores

vs. Other possibilities

Graphene?

Organic?

Quantum?

vs. Other possibilities

NehalemTylersburgWestmere

Sandy BridgeRomleyIvy Bridge

HaswellHaswellRockwell

SkylakeSkylakeSkymont

2009 2010 2011 2012 2013 2014 2015 2016 2017 2018

45nm 32nm 22nm 14nm 10nm

Intel Server Microarchitecture Roadmapaccording to computerbase.de, 2011

NoC as the key to manycore success

realizes every communication between cores.

On-chip network

consumes energy proportionally to traffic size.

provides key mechanisms for parallel programming.

Outline

NoCfor

Manycore

Network-level

Optimization

Physical-level

Design

@ 45nm

System-level

Optimization

NoCARC’09

NOCS’11

PACT’09

EM2 Chip

’12/’13

Network-level Optimization:

As simple as oblivious network,As efficient as adaptive network

PROM – path-based oblivious routing

Path-based, Randomized, Oblivious, Minimal RoutingMyong Hyon Cho, Mieszko Lis, Keun Sup Shim, Michel Kinsy, and Srinivas Devadas

NoCArc’09

overcomes the limitation of oblivious routing by enhanced path diversity.

Oblivious routing vs Adaptive routing

Local and Simple

Oblivious routing

Possibly poor resource utilization

Possibly betterresource utilization

Adaptive routing

Global informationrequired

Oblivious routing vs Adaptive routing

Local and Simple

Oblivious routing

Possibly poor resource utilization

Possibly betterresource utilization

Adaptive routing

Global informationrequired

For on-chip networks…

Because performance/area overhead of adaptive routing is more significant in on-chip networks than in large-scale networks.

Poor utilization of oblivious routing

DOR (XY)

Path diversity improves oblivious routing

O1TURN

• Diversity helps improve utilization and reduce congestion.

Path diversity improves oblivious routing

• Diversity helps improve utilization and reduce congestion.

IBIB DB

Valiant ROMM (2-phase)

Network-level deadlock

• A dependency cycle on network resources causes network-level deadlocks.

Channel Dependency Graph (CDG)

Deadlock prevention

DOR never creates dependency cycles.

XY and YX paths of O1TURN cause cycles.

O1TURN requires 2 networks to separate them.

Each phase of ROMM cause cycles.

n-phase ROMM uses n networks to separate them.

Each phase of Valiant cause cycles.

Valiant requires 2 networks to separate them.

…which we found to be wrong!n-phase ROMM only requires 2 networks.

Various oblivious routing schemes

DOR O1TURN2-phase ROMM

n-phase ROMM

Valiant

Path diversity None Minimum Limited Fair~Large Large

# networksfor deadlockprevention

1 2 2n

*erroneouslyproposed

# hops minimal minimal minimal minimal non-minimal

Comm. overhead

None Nonelog2(N)bits/pkt

(n-1) log2(N)bits/pkt

log2(N)bits/pkt

PROM Routing

Path-based

Oblivious Minimal

Randomized

Goal: Best minimal-path diversity

- Use ALL possible minimal routes- Each minimal route has the SAME CHANCE to be taken.

PROM Routing

Path-based

Oblivious Minimal

Randomized

At each hop, where there are multiple choices,

…compare the number of possible minimal paths after each choice

PROM Routing

SA 75%

Path-based

Oblivious Minimal

Randomized

PROM Routing

67%75%

Path-based

Oblivious Minimal

Randomized

PROM Routing

SA 67%75%

Path-based

Oblivious Minimal

Randomized

PROM Routing

SA 67%75%

Path-based

Oblivious Minimal

Randomized

PROM Routing

SA 67%75%

Path-based

Oblivious Minimal

Randomized

PROM Routing

Path-based

Oblivious Minimal

SA 67%75%

50%100%

Randomized

The chance of this path to be taken is:

75%×67%×50%×100%= 25%

Probability Calculation

• The probability function is reduced to a simple ratio.

NY = (x+y-1)!x!(y-1)!

NX = (x+y-1)!(x-1)!y!

PY = NY

PX = X+y

x When X>0 and y>0

= x!(y-1)!

( + ) x!(y-1)!

(x-1)!y!

Large-box Problem

• Paths are equally taken, but links are not.

srcdst

link utilization on the minimal-path box

When the MPB is large- edges are underutilized.- inner links are congested,possibly with other flows inside.

Uniform PROM

Immediate Upstream Router

Don’t careX+y

Parameterized PROM

Immediate Upstream Router

On the X axis

On the Y axis

X+y+fx+f

X+y+fy

X+y+fx

X+y+fy+f

Parameterized PROM

f=10 f=25f=0

link utilization on the minimal-path boxparameterized PROM

Deadlock prevention

• Turn Models [Glass et al./J.ACM’94]:- Each turn model is a set of allowed turns.- No deadlock if all routes conform to the same turn model.

West-First Turn Model North-Last Turn Model

Deadlock prevention

Any minimal routing on a 2D mesh network conforms to either one of two turn models.*

* Keun Sup Shim, Myong Hyon Cho, Michel Kinsy, Tina Wen, Mieszko Lis, Edward Suh, and

Srinivas Devadas, Static Virtual Channel Allocation in Oblivious Routing, NOCS’09

No north-east nor south-east turnsconforms to the West-First turn model

No north-west nor south-west turnsconforms to the North-Last turn model

Performance Evaluation

Various oblivious routing schemes

DOR O1TURN2-phase ROMM

n-phase ROMM

Valiant PROM

Path diversity

None Minimum Limited Fair~Large Large Fair~Large

# networksfor deadlockprevention

1 2 2 n* 2 2

# hops minimal minimal minimal minimalnon-

minimalminimal

Comm. overhead

None Nonelog2(N)bits/pkt

(n-1) log2(N)bits/pkt

log2(N)bits/pkt

Heavy-loadPerformance

Fair Good Bad Worst Worst Best

BAN – bandwidth adaptive network

achieves adaptivity with oblivious routing, using locally arbitrated bi-directional network links.

Oblivious Routing in On-Chip Bandwidth-Adaptive NetworksMyong Hyon Cho, Mieszko Lis, Keun Sup Shim, Michel Kinsy, Tina Wen, and Srinivas Devadas

PACT’09

Oblivious routing failure

congested

Where can we do better?

Adaptive Network, not routing

Increasedbandwidth

• A set of bidirectional links connects network nodes.- The bandwidth of the link in one direction can be increased at the expense of the other direction.

Adaptive Network, not routing

(a)When yellow flow is dominant

(b)When gray flow is dominant

Routes do not change, and arbitration is all local.

BAN Hardware

Most hardware overhead in the crossbar

BandwidthAllocatorpressure pressure

direction

(1, …, v)

Xbarswitch

(1, …, v)

Xbarswitch

from other nodes from other nodes to other nodes

to other nodesto other nodes

to other nodes

Crossbar – 2 links, Unidirectional

• 4-input, 4-output, 4 Virtual Channels

Crossbar– 2 links, Bidirectional

Links Switch# xBar Inputs

# xBar Outputs

Relative xBar Size

Unidirectional

VC-to-Port(fully connected) 16 4 64

Bidirectional

Crossbar Size – 2 links

# xBar Outputs

Relative xBar Size

Unidirectional

Bidirectional

Hybrid

Crossbar Size – 4 links

The hybrid configuration has a 1.5 times larger crossbar, which typically increases the node size by around 15%.

Bandwidth Allocation

• Local arbiters between any two adjacent routers

Bandwidth Arbiter3 flits 1 flit

The arbitration follows demands from each router, always leaving at least one link in one direction

if there is any flit that can move in that direction.

Symmetry vs. Anti-symmetry

Bit-complement Transpose

*Both under dimension order routing

Anti-symmetric Traffic

Symmetric Traffic

Symmetric Traffic with Burstiness

Traffic Pattern Non-bursty Bursty

Bit-complement 0% 20%

Uniform Random 8% 26%

How about real application traffic…?

• The traffic patterns in many real applications are not symmetric as data is processed by a sequence of modules.

System-level Optimization:

autonomous & fine-grainedthread migration protocol by NoC

ENC – exclusive native context

provides the first deadlock-free protocol for autonomous thread migration for any microarchitecture.

Deadlock-Free Fine-Grained Thread MigrationMyong Hyon Cho, Mieszko Lis, Keun Sup Shim, Omer Khan, and Srinivas Devadas

NOCS’11 – Best Paper Award

Why thread migrations again?

• For a simple reason: it’s cheaper on a single die (so we can do it more often).

ThreadMotion [Rangan et al., ISCA09]

Higher Voltage/Frequency

Lower Voltage/Frequency

cache misses cache hits

Architectural Core Salvaging [Powell et al., ISCA09]

has no defectsfloating-point ops

has a defective floating-pointunit

Execution Migration Machine (EM2) [Lis et al., SPAA11/CSAIL-TR]

Each has the only copy of data on-chip.data misses

Migration protocols aren’t catching up...

• …use a centralized scheduler (e.g., an OS). - slow!

• …store contexts in extra buffer or in the memory hierarchy.- expensive and inefficient!

• …bring restrictions on how threads can migrate.- cannot exploit the full power of migration!

Need a fast migration protocol that...

• …provides functional correctness for arbitrary migrations.

• …supports autonomous migration scheduling.

• …with a simple & small implementation.

Protocol-level Deadlock

Core C

Router C

Core D

Router D

Core E

Router E

Core F

Router F

Core A

Router A

Core B

Router B

If an autonomous migration protocol is careless…

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORYMIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

• SWAP : A deadlock-prone autonomous migration protocol

• An eviction swaps the locations of two threads.

threads

• 100 random, synthetic migration patterns.• 64 threads on 64 core, migrating in every 100 cycles• Network-level deadlock-free routing (DOR-XY)

1 2 3 40

2 VCs / No Buffer4 VCs / No Buffer2 VCs / 4 contexts2 VCs / 8 contexts

Number of Hotspots

Exclusive Native Context(ENC) protocols

• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!

a running thread A

Exclusive Native Context(ENC) protocols

coremigration

a running thread A

eviction

Exclusive Native Context(ENC) protocol

coremigration

eviction

a running thread A

migrating threads must not block evicted threads.

coremigration

eviction

a running thread A

Separating virtual channel sets is a simple solution.

coremigration

eviction

a running thread A

native core

exclusivespace

Each thread has its own native core.

Application performance results

• Total migration distance : no overhead in real applications

RANDOM

FFT RADIX LU OCEAN WATER0

SWAPinf

RANDOM FFT RADIX LU OCEAN WATER0

2SWAP SWAPinf ENC

• Completion time : 11.7% overhead of ENC over SWAPinf (on avg.)

Physical-level Design:

NoC router implementation for EM2 (IBM SOI 45nm)

EM2 Implementation - Overview

110-core Shared Memory Processor

ISA EM2 Stack ISA

Shared MemoryArchitecture

1. EM2

2. RA (Remote Access)3. EM2+RA

Cache 8KB I$ / 32KB D$ at each core Total 4.4MB on Chip Single-cycle read hits, two-cycle write hits

Technology IBM SOI12SO 45nm

IPARM sc12 library (High voltage threshold),IBM SRAM compiler,IBM IO library (wire-bonding), IBM PLL, etc.

NoC router specification for EM2

Channels

Communication Unicast, in-order

ArchitecturalPerformance 1 cycle/hop

SchedulingAlgorithm Maximal scheduling

Routing DOR

Network Buffer Single 4-flit ingress buffer for each port

Remote Access

Migration (EM2)

DRAM Access

Migration

Eviction

Request

Response

Request

Response

Six independent 64-bit channels

6 Independent Physical Networks

6-network router with maximal scheduling

Metal Layers Usage

m1, m2, m3 Local logic

c1, c2 Local routing

b1, b2, b3Remote routing/ power grid

ua, ub Global power grid

lb Chip IO

Six 64-bit networks needs a width of 222um.

Tile Floorplanning

Router

32KB D$

8KB I$

Tile Floorplanning

Tile floorplan for EM2 tile

Tile Floorplanning

Placement Results

ROUTER

PREDICTOR

EM2 tile

Width 855um

Height 917um

RC extracted STA(@typical)

WorkingFrequency

105MHz

Hold timeSlack

PowerEstimation (10% activity)

D$ D$ D$ D$ I$ I$

D$tags

I$tags

Chip Floorplanning

Connecting Router Links

Chip-level Clock Tree

Tile-level Clock Tree

EM2 chip

Width 10mm

Height 10mm

~357 Million Transistors

11-by-10EM2 tile array

CLKD-CAPs D-CAPs

18man-month

EM2 tile arraybelow

the top 2 metal layers

On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho

Documents

Addressing Heterogeneity in Manycore Applications

Advanced Programming of ManyCore Systems

Architekturen von Multi- und Manycore-Prozessoren · Architekturen von Multi- und Manycore-Prozessoren Johannes Hofmann Einführungsveranstaltung, 12.04.2016. Inhalt ... Apple Cyclone

Architecting Solutions for the Manycore Future

Manycore Designs - Rochester Institute of Technology

Scheduler performance in manycore architecture

SAINT- MARTIN HYON

Efﬁcient and Predictable Group Communication for Manycore …

Scalable Parallel Programming with CUDA on Manycore GPUs

On-chip Networks for Manycore Architecturepeople.csail.mit.edu/mhcho/Personal_site/pdfs/main_embedded_4.pdf · On-chip Networks for Manycore Architecture by Myong Hyon Cho Submitted

The Epiphany Manycore Architecture - Adapteva · The Epiphany Manycore Architecture Seminar at Halmstad Högskola March 6, 2012 andreas@adapteva.com ... Epiphany Manycore Architecture

Computação Manycore: Uma Arquitetura muito além do Multicore!

Min Hyon Sik Thesis on Korean Language Education

Manycore Algorithms for Batch Scalar and Block …people.maths.ox.ac.uk/gilesm/files/toms_16b.pdf · Manycore Algorithms for Batch Scalar and Block Tridiagonal Solvers ... [1993]

Dr. Sang Hyon Lee Korea Employment Information Service (KEIS)

Preparing your Application for Advanced Manycore Architectures

Hyon Gak Sunim - 10 Oxherding Pictures

Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

H-cholesky on manycore

The Supercomputer “Fugaku” and A64FX Manycore Processor