Upload
decima
View
73
Download
0
Embed Size (px)
DESCRIPTION
McRouter : Multicast within a Router for High Performance NoCs. Yuan He , Hiroshi Sasaki*, Shinobu Miwa, Hiroshi Nakamura The University of Tokyo and *Kyushu University. Executive Summary. - PowerPoint PPT Presentation
Citation preview
1
McRouter: Multicast within a Router for High Performance NoCs
Yuan He, Hiroshi Sasaki*,Shinobu Miwa, Hiroshi Nakamura
The University of Tokyo and *Kyushu University
Executive Summary• Like other networks, NoCs are latency critical. But through evaluations,
we also observed that they can be quite bandwidth plentiful (within the routers)
• We propose to have packets multicast within a router (routed to all possible outputs), so that route computation is completely hidden and is only required to acknowledge the ONE correctly routed packet in a multicasting
• Results show that– McRouter incurs more productive use of its internal bandwidth– It outperforms the Prediction Router (the best router so far) with nearly all
application traffic we evaluated
Outline
• Scope of the Work• Motivation• Proposal: Multicast within a Router• Evaluations and Results• Conclusion
4
Scope
• On-chip routers
• Standalone router designs– So not based on look-ahead routing– Conventional Router– Prediction Router (HPCA 2009, Matsutani et al)
• Mesh topology– But the idea should be able to other topologies as well
5
Motivation
• Modern On-chip Networks– Latency Critical• NoCs affects cache/memory access latency
– Let us look at two router designs• Conventional Router (4-cycle)• Prediction Router (1-cycle when prediction succeeds)
Conventional Router (CR)
• Conventional Virtual Channel Router– BW/RC -> VA -> SA -> ST• Problem -> 4 cycles
P P P
P
1234
BW: Buffer WriteRC: Route ComputationVA: Virtual Channel AllocationSA: Switch AllocationST: Switch Traversal
Prediction Router (PR, Hit)
• Prediction Router (HPCA 2009, Matsutani et al)– If prediction hits (and VA/SA succeeds with
this predicted RC), only ST is needed (1-cycle)
Route Computation
VCs
VC Allocator
Switch Allocator
Input 1
Input n
Output 1
Output n
Credits InCredits Out
VCs
PipelineRegister
PipelineRegister
Predictor(s)
Predictor(s)Kill Signals
Kill Signals
P P P
P
1
Prediction Router (PR, Miss)
• Prediction Router– If prediction misses, miss-routed packets get killed and the
conventional data path is then used– Problem -> prediction accuracy is around 65% in our evaluation
Route Computation
VCs
VC Allocator
Switch Allocator
Input 1
Input n
Output 1
Output n
Credits InCredits Out
VCs
PipelineRegister
PipelineRegister
Predictor(s)
Predictor(s)Kill Signals
Kill Signals
P P P
P
1
9
Motivation (cont…)
• Modern On-chip Networks– Bandwidth Plentiful
– Observations
Observation 1: Avearge Link Utilization
barnes
cholesky
fm
m
ocean
(cont.)
ocean
(non-co
nt.)
raytra
ce
volre
nd
water (n
squared)
water (s
patial)
EP FT IS LU0
0.0050.01
0.0150.02
0.0250.03
0.0350.04
0.0450.05
Aver
age
Link
Util
izatio
n (fl
its/l
ink/
cycl
e)
Observation 1: Avearge Link Utilization
• 0.031 flits/link/cycle for the worst case - FT– 0.2 flits / crossbar / cycle assuming a radix-6
router Little contention internally
12
Observation 2: Concurrent Flits to a Router
barnes
cholesky
fm
m
ocean
(cont.)
ocean
(non-co
nt.)
raytra
ce
volre
nd
water (n
squared)
water (s
patial)
EP FT IS LU80%82%84%86%88%90%92%94%96%98%
100%
0 1 >=2
Frac
tion
of N
umbe
rs o
f Co
ncur
rent
Flit
s
Observation 2: Concurrent Flits to a Router
P
P
• Taking the worst case workload – FT– 83% of the time -> no incoming flits– 15% of the time -> 1 flit only– 2 % of the time -> 2+ flits
Very few chances of encountering concurrent flits
14
Proposal: Multicast within a Router• Or McRouter for short– Single-cycle router when having enough
bandwidth– Is based on multicast operation inside a router– A multicast is like a always-correct prediction• No predictors
Conventional Router Prediction Router McRouter
15
McRouter: Conditions to Invoke A Multicasting
Route Computation
VCs
VC Allocator
Switch Allocator
Input 1
Input n
Output 1
Output n
Credits InCredits Out
VCs
Multicast Unit
ACK 1
Valid VCID 1
Valid VCID n
ACK n
1) Only 1 flit arrives at the router (which means no concurrent flits)2) Within this router, no flit is waiting to undertake ST (switch
traversal)
P
16
Multicasting Operation
Route Computation
VCs
VC Allocator
Switch Allocator
Input 1
Input n
Output 1
Output n
Credits InCredits Out
VCs
Multicast Unit
ACK 1
Valid VCID 1
Valid VCID n
ACK n
PP
P
P
A Summary on McRouter
• Pros– A single cycle router when internal bandwidth
allows– No predictors
• Cons–More complex control over the crossbar switch– Killing of more miss-routed flits
Evaluation Methodology• CPU Model: Simics 3.0.31
– 16 cores, in-order• Memory Model: GEMS 2.1.1
– 32KB L1 I/D Caches– 256KB L2 Cache X 16 Banks– 4 Memory Controllers, 4GB main memory
• NoC Model: GARNET– 4 X 4 Mesh with virtual channel routers
• NoC Power Model: Orion 2– 32nm process and 1V Vdd
• Synthetic Traffic: Uniform Radom• Benchmarks: 13 workloads
– From SPLASH-2 and NPB-3• Counterparts: CR and PR
Router Link
Core/L1$s
L2$
MemoryController
Router
Link
0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2 0.225 0.25 0.275 0.3 0.30530
35
40
45
50
55Conventional RouterPrediction Router (LPM)Prediction Router (FCM)McRouter
Injection Rate (flits/node/cycle)
Per-
Flit
Late
ncy
(cyc
le)
Evaluations with Synthetic Traffic
0.07 flits/link/cycle
0.34 flits/link/cycle
barnes
cholesky
fm
m
ocean
(cont.)
ocean
(non-co
nt.)
raytra
ce
volre
nd
water (n
squared)
water (s
patial)
EP FT IS LU
Geometric M
ean0.9
1
1.1
1.2
1.3
1.4
1.5
Conventional Router Prediction Router (LPM) Prediction Router (FCM) McRouter
Evaluations with Application Traffic:Normalized System Speed-up
128-bit, 4 VCs 64-bit, 4 VCs 128-bit, 1 VC0.9
1
1.1
1.2
1.3
1.4
1.5
CR PR(LPM) PR(FCM) McRouter
Nor
mal
ized
Syst
em S
peed
-up
128-bit, 4VCs 64-bit, 4 VCs 128-bit, 1 VC0.9
0.951
1.051.1
1.151.2
1.251.3
1.35
CR PR(LPM) PR(FCM) McRouter
Sensitivity Study with Network Parameter Downscaling
• Parameters downscaled– Link width halved– # of VCs minimized
• McRouter still works with thinned bandwidth– Its advantages over CR/PR is not from over-designing
Workload: raytrace Workload: FT
22
Conclusion
• A new low-latency router– It successfully hides route computation and
arbitration delays while still being a standalone design
– It outperforms PR (best router so far) in practice– We uncover an insight that with more aggressive
utilization of remaining internal bandwidth, a router can have its latency dramatically shortened with simple architectural changes
Thank you so much for attention!