23
McRouter: Multicast within a Router for High Performance NoCs 1 Yuan He, Hiroshi Sasaki*, Shinobu Miwa, Hiroshi Nakamura The University of Tokyo and *Kyushu University

McRouter : Multicast within a Router for High Performance NoCs

  • Upload
    decima

  • View
    73

  • Download
    0

Embed Size (px)

DESCRIPTION

McRouter : Multicast within a Router for High Performance NoCs. Yuan He , Hiroshi Sasaki*, Shinobu Miwa, Hiroshi Nakamura The University of Tokyo and *Kyushu University. Executive Summary. - PowerPoint PPT Presentation

Citation preview

Page 1: McRouter : Multicast within a Router for High Performance  NoCs

1

McRouter: Multicast within a Router for High Performance NoCs

Yuan He, Hiroshi Sasaki*,Shinobu Miwa, Hiroshi Nakamura

The University of Tokyo and *Kyushu University

Page 2: McRouter : Multicast within a Router for High Performance  NoCs

Executive Summary• Like other networks, NoCs are latency critical. But through evaluations,

we also observed that they can be quite bandwidth plentiful (within the routers)

• We propose to have packets multicast within a router (routed to all possible outputs), so that route computation is completely hidden and is only required to acknowledge the ONE correctly routed packet in a multicasting

• Results show that– McRouter incurs more productive use of its internal bandwidth– It outperforms the Prediction Router (the best router so far) with nearly all

application traffic we evaluated

Page 3: McRouter : Multicast within a Router for High Performance  NoCs

Outline

• Scope of the Work• Motivation• Proposal: Multicast within a Router• Evaluations and Results• Conclusion

Page 4: McRouter : Multicast within a Router for High Performance  NoCs

4

Scope

• On-chip routers

• Standalone router designs– So not based on look-ahead routing– Conventional Router– Prediction Router (HPCA 2009, Matsutani et al)

• Mesh topology– But the idea should be able to other topologies as well

Page 5: McRouter : Multicast within a Router for High Performance  NoCs

5

Motivation

• Modern On-chip Networks– Latency Critical• NoCs affects cache/memory access latency

– Let us look at two router designs• Conventional Router (4-cycle)• Prediction Router (1-cycle when prediction succeeds)

Page 6: McRouter : Multicast within a Router for High Performance  NoCs

Conventional Router (CR)

• Conventional Virtual Channel Router– BW/RC -> VA -> SA -> ST• Problem -> 4 cycles

P P P

P

1234

BW: Buffer WriteRC: Route ComputationVA: Virtual Channel AllocationSA: Switch AllocationST: Switch Traversal

Page 7: McRouter : Multicast within a Router for High Performance  NoCs

Prediction Router (PR, Hit)

• Prediction Router (HPCA 2009, Matsutani et al)– If prediction hits (and VA/SA succeeds with

this predicted RC), only ST is needed (1-cycle)

Route Computation

VCs

VC Allocator

Switch Allocator

Input 1

Input n

Output 1

Output n

Credits InCredits Out

VCs

PipelineRegister

PipelineRegister

Predictor(s)

Predictor(s)Kill Signals

Kill Signals

P P P

P

1

Page 8: McRouter : Multicast within a Router for High Performance  NoCs

Prediction Router (PR, Miss)

• Prediction Router– If prediction misses, miss-routed packets get killed and the

conventional data path is then used– Problem -> prediction accuracy is around 65% in our evaluation

Route Computation

VCs

VC Allocator

Switch Allocator

Input 1

Input n

Output 1

Output n

Credits InCredits Out

VCs

PipelineRegister

PipelineRegister

Predictor(s)

Predictor(s)Kill Signals

Kill Signals

P P P

P

1

Page 9: McRouter : Multicast within a Router for High Performance  NoCs

9

Motivation (cont…)

• Modern On-chip Networks– Bandwidth Plentiful

– Observations

Page 10: McRouter : Multicast within a Router for High Performance  NoCs

Observation 1: Avearge Link Utilization

barnes

cholesky

fm

m

ocean

(cont.)

ocean

(non-co

nt.)

raytra

ce

volre

nd

water (n

squared)

water (s

patial)

EP FT IS LU0

0.0050.01

0.0150.02

0.0250.03

0.0350.04

0.0450.05

Aver

age

Link

Util

izatio

n (fl

its/l

ink/

cycl

e)

Page 11: McRouter : Multicast within a Router for High Performance  NoCs

Observation 1: Avearge Link Utilization

• 0.031 flits/link/cycle for the worst case - FT– 0.2 flits / crossbar / cycle assuming a radix-6

router Little contention internally

Page 12: McRouter : Multicast within a Router for High Performance  NoCs

12

Observation 2: Concurrent Flits to a Router

barnes

cholesky

fm

m

ocean

(cont.)

ocean

(non-co

nt.)

raytra

ce

volre

nd

water (n

squared)

water (s

patial)

EP FT IS LU80%82%84%86%88%90%92%94%96%98%

100%

0 1 >=2

Frac

tion

of N

umbe

rs o

f Co

ncur

rent

Flit

s

Page 13: McRouter : Multicast within a Router for High Performance  NoCs

Observation 2: Concurrent Flits to a Router

P

P

• Taking the worst case workload – FT– 83% of the time -> no incoming flits– 15% of the time -> 1 flit only– 2 % of the time -> 2+ flits

Very few chances of encountering concurrent flits

Page 14: McRouter : Multicast within a Router for High Performance  NoCs

14

Proposal: Multicast within a Router• Or McRouter for short– Single-cycle router when having enough

bandwidth– Is based on multicast operation inside a router– A multicast is like a always-correct prediction• No predictors

Conventional Router Prediction Router McRouter

Page 15: McRouter : Multicast within a Router for High Performance  NoCs

15

McRouter: Conditions to Invoke A Multicasting

Route Computation

VCs

VC Allocator

Switch Allocator

Input 1

Input n

Output 1

Output n

Credits InCredits Out

VCs

Multicast Unit

ACK 1

Valid VCID 1

Valid VCID n

ACK n

1) Only 1 flit arrives at the router (which means no concurrent flits)2) Within this router, no flit is waiting to undertake ST (switch

traversal)

P

Page 16: McRouter : Multicast within a Router for High Performance  NoCs

16

Multicasting Operation

Route Computation

VCs

VC Allocator

Switch Allocator

Input 1

Input n

Output 1

Output n

Credits InCredits Out

VCs

Multicast Unit

ACK 1

Valid VCID 1

Valid VCID n

ACK n

PP

P

P

Page 17: McRouter : Multicast within a Router for High Performance  NoCs

A Summary on McRouter

• Pros– A single cycle router when internal bandwidth

allows– No predictors

• Cons–More complex control over the crossbar switch– Killing of more miss-routed flits

Page 18: McRouter : Multicast within a Router for High Performance  NoCs

Evaluation Methodology• CPU Model: Simics 3.0.31

– 16 cores, in-order• Memory Model: GEMS 2.1.1

– 32KB L1 I/D Caches– 256KB L2 Cache X 16 Banks– 4 Memory Controllers, 4GB main memory

• NoC Model: GARNET– 4 X 4 Mesh with virtual channel routers

• NoC Power Model: Orion 2– 32nm process and 1V Vdd

• Synthetic Traffic: Uniform Radom• Benchmarks: 13 workloads

– From SPLASH-2 and NPB-3• Counterparts: CR and PR

Router Link

Core/L1$s

L2$

MemoryController

Router

Link

Page 19: McRouter : Multicast within a Router for High Performance  NoCs

0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2 0.225 0.25 0.275 0.3 0.30530

35

40

45

50

55Conventional RouterPrediction Router (LPM)Prediction Router (FCM)McRouter

Injection Rate (flits/node/cycle)

Per-

Flit

Late

ncy

(cyc

le)

Evaluations with Synthetic Traffic

0.07 flits/link/cycle

0.34 flits/link/cycle

Page 20: McRouter : Multicast within a Router for High Performance  NoCs

barnes

cholesky

fm

m

ocean

(cont.)

ocean

(non-co

nt.)

raytra

ce

volre

nd

water (n

squared)

water (s

patial)

EP FT IS LU

Geometric M

ean0.9

1

1.1

1.2

1.3

1.4

1.5

Conventional Router Prediction Router (LPM) Prediction Router (FCM) McRouter

Evaluations with Application Traffic:Normalized System Speed-up

Page 21: McRouter : Multicast within a Router for High Performance  NoCs

128-bit, 4 VCs 64-bit, 4 VCs 128-bit, 1 VC0.9

1

1.1

1.2

1.3

1.4

1.5

CR PR(LPM) PR(FCM) McRouter

Nor

mal

ized

Syst

em S

peed

-up

128-bit, 4VCs 64-bit, 4 VCs 128-bit, 1 VC0.9

0.951

1.051.1

1.151.2

1.251.3

1.35

CR PR(LPM) PR(FCM) McRouter

Sensitivity Study with Network Parameter Downscaling

• Parameters downscaled– Link width halved– # of VCs minimized

• McRouter still works with thinned bandwidth– Its advantages over CR/PR is not from over-designing

Workload: raytrace Workload: FT

Page 22: McRouter : Multicast within a Router for High Performance  NoCs

22

Conclusion

• A new low-latency router– It successfully hides route computation and

arbitration delays while still being a standalone design

– It outperforms PR (best router so far) in practice– We uncover an insight that with more aggressive

utilization of remaining internal bandwidth, a router can have its latency dramatically shortened with simple architectural changes

Page 23: McRouter : Multicast within a Router for High Performance  NoCs

Thank you so much for attention!