DUKE UNIVERSITY
Self-Tuned Congestion Control Self-Tuned Congestion Control for Multiprocessor Networksfor Multiprocessor Networks
Shubhendu S. MukherjeeShubhendu S. [email protected]@compaq.com
VSSAD, Alpha Development GroupVSSAD, Alpha Development GroupCompaq Computer CorporationCompaq Computer Corporation
Shrewsbury, MassachusettsShrewsbury, Massachusetts
Mithuna ThottethodiMithuna ThottethodiAlvin R. LebeckAlvin R. Lebeck{mithuna,alvy}@cs.duke.edu{mithuna,alvy}@cs.duke.eduDepartment of Computer SciencesDepartment of Computer SciencesDuke University, Durham, North Duke University, Durham, North CarolinaCarolina
Appeared in the 7th International Symposium on High-Performance Computer Architecture (HPCA), Monterrey, Mexico, January, 2001
Slide 2DUKE UNIVERSITY
Network SaturationNetwork Saturation
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0 10 20 30 40 50 60 70 80 90 100
Packet regeneration Interval (cycles)
No
rma
lize
d A
cce
pte
d
Th
rou
gh
pu
t (F
lits
/No
de
/Cyc
le)
Butterfly
Random
Slide 3DUKE UNIVERSITY
Why Network Saturation?Why Network Saturation?
router
• Tree saturation• Deadlock cycles • New packets block older packets• Backpressure take 1000s of cycles to propagate back
Slide 4DUKE UNIVERSITY
Why Do We Care?Why Do We Care?
Computation power per router is increasingComputation power per router is increasing More aggressive speculationMore aggressive speculation Simultaneous Multithreading Simultaneous Multithreading Chip MultiprocessorsChip Multiprocessors
““Unstable” behavior makes designers very nervousUnstable” behavior makes designers very nervous
Router
CPUs
Slide 5DUKE UNIVERSITY
So, what’s the solution? So, what’s the solution?
ThrottleThrottle stop injecting packets when you hit a “threshold”stop injecting packets when you hit a “threshold” ““threshold” = % full network buffers threshold” = % full network buffers
ButBut Local estimate of threshold insufficientLocal estimate of threshold insufficient Saturation point differs for communication patternsSaturation point differs for communication patterns
QuestionsQuestions How do we collect global estimate of % full network buffers?How do we collect global estimate of % full network buffers? How do we “tune” the threshold to different patterns?How do we “tune” the threshold to different patterns?
Slide 6DUKE UNIVERSITY
OutlineOutline
Overview Overview Multiprocessor Network BasicsMultiprocessor Network Basics
Deadlocks & virtual channelsDeadlocks & virtual channels Adaptive routing & Duato’s theoryAdaptive routing & Duato’s theory
How to collect global estimate of congestion?How to collect global estimate of congestion? How to “tune” the throttle threshold?How to “tune” the throttle threshold? Methodology & ResultsMethodology & Results Summary, Future Work, & Other ProjectsSummary, Future Work, & Other Projects
Slide 7DUKE UNIVERSITY
A Multiprocessor NetworkA Multiprocessor Network
router
Slide 8DUKE UNIVERSITY
Deadlock AvoidanceDeadlock Avoidance
1 2
34
1 3
2 4
3 1
4 2
Deadlocked
1 2
34
1 3
2 4
3 1
4 2
Virtual Channels(red & yellow)
Slide 9DUKE UNIVERSITY
Virtual Channels (VC)Virtual Channels (VC)
1
34
1 3
2 4
3 1
4 2
One Buffer Per VC
Logically, red and yellow networks (deadlock-free)
Slide 10DUKE UNIVERSITY
Duato’s TheoryDuato’s Theory
Adaptive network for high performanceAdaptive network for high performance deadlock-pronedeadlock-prone
Deadlock-free network when adaptive network deadlocksDeadlock-free network when adaptive network deadlocks drop down to deadlock-free when router is congesteddrop down to deadlock-free when router is congested
Implemented with different virtual channelsImplemented with different virtual channels adaptive virtual channelsadaptive virtual channels deadlock-free virtual channels (escape channels)deadlock-free virtual channels (escape channels)
Slide 11DUKE UNIVERSITY
OutlineOutline
Overview Overview Multiprocessor Network BasicsMultiprocessor Network Basics How to collect global estimate of congestion? How to collect global estimate of congestion? How to “tune” the throttle threshold? How to “tune” the throttle threshold? Methodology & ResultsMethodology & Results Summary, Future Work, & Other ProjectsSummary, Future Work, & Other Projects
Slide 12DUKE UNIVERSITY
Global Estimate of CongestionGlobal Estimate of Congestion
% of full buffers in entire network% of full buffers in entire network more & more buffers occupied when network saturatesmore & more buffers occupied when network saturates throttle network when % full buffers cross thresholdthrottle network when % full buffers cross threshold
AdvantagesAdvantages simple aggregationsimple aggregation empirical observation: works wellempirical observation: works well
DisadvantagesDisadvantages doesn’t detect localized congestiondoesn’t detect localized congestion threshold differs for communication patterns (we solve this)threshold differs for communication patterns (we solve this)
Slide 13DUKE UNIVERSITY
Gather Global InformationGather Global Information
Global InformationGlobal Information % full network buffers in an “interval”% full network buffers in an “interval” % packets or flits delivered during an “interval”% packets or flits delivered during an “interval”
ConstraintConstraint gather time << backpressure buildup time (1000s of cycles)gather time << backpressure buildup time (1000s of cycles)
MechanismsMechanisms piggybackingpiggybacking meta-packetsmeta-packets side-band signalside-band signal
Slide 14DUKE UNIVERSITY
Sideband: Dimension-wise AggregationSideband: Dimension-wise Aggregation
Each hop takes Each hop takes h h cycles on the sidebandcycles on the sideband
After After 22 hops, aggregation in one dimenstion done hops, aggregation in one dimenstion done
22 such phases such phases
Total gather time = Total gather time = 2 * 2 * h2 * 2 * h = = 4h4h cycles cycles
For k-ary, n-cubes, gather-time (g) = For k-ary, n-cubes, gather-time (g) = n * k * h / 2n * k * h / 2
For a 16x16 network, g = 2 * 16 * 2 / 2 = 32 cyclesFor a 16x16 network, g = 2 * 16 * 2 / 2 = 32 cycles
Phase I Phase 2
Slide 15DUKE UNIVERSITY
OutlineOutline
Overview Overview Multiprocessor Network BasicsMultiprocessor Network Basics How to collect global estimate of congestion? How to collect global estimate of congestion? How to “tune” the throttle threshold?How to “tune” the throttle threshold? Methodology & ResultsMethodology & Results Summary, Future Work, & Other ProjectsSummary, Future Work, & Other Projects
Slide 16DUKE UNIVERSITY
Dynamic Detection of ThresholdDynamic Detection of Threshold(Hill Climbing)(Hill Climbing)
B
A
C
% full buffers (%)0
Thr
ough
put
Yes No
No Increment No Change
Yes Decrement Decrement
Currently throttling?Drop in Bandwidth > 25%
Threshold
… we may still creep into saturation (later)
Slide 17DUKE UNIVERSITY
Summary of ApproachSummary of Approach
Global Knowledge of a NetworkGlobal Knowledge of a Network Collect % full network buffers and overall throughputCollect % full network buffers and overall throughput Dimension-wise aggregation, g-cycle snapshotsDimension-wise aggregation, g-cycle snapshots Aggregation via sideband signalsAggregation via sideband signals
Dynamically detect throttling thresholdDynamically detect throttling threshold Threshold = % of full network buffersThreshold = % of full network buffers Self-tuned using hill climbingSelf-tuned using hill climbing Reset if hill climbing failsReset if hill climbing fails
Slide 18DUKE UNIVERSITY
OutlineOutline
Overview Overview Multiprocessor Network BasicsMultiprocessor Network Basics How to collect global estimate of congestion? How to collect global estimate of congestion? How to “tune” the throttle threshold?How to “tune” the throttle threshold? Methodology & ResultsMethodology & Results Summary, Future Work, & Other ProjectsSummary, Future Work, & Other Projects
Slide 19DUKE UNIVERSITY
MethodologyMethodology Flitsim 2.0 Simulator (Pinkston’s group at USC)Flitsim 2.0 Simulator (Pinkston’s group at USC)
warmup for 10k cycles, simulate for 50k cycleswarmup for 10k cycles, simulate for 50k cycles
Network architectureNetwork architecture 16x16 two-dimensional torus (16-ary, 2-cube)16x16 two-dimensional torus (16-ary, 2-cube) Full-duplex linksFull-duplex links Packet size = 16 flitsPacket size = 16 flits Wormhole routingWormhole routing Deadlock avoidance (paper has deadlock recovery results)Deadlock avoidance (paper has deadlock recovery results)
Router architectureRouter architecture 3 virtual channels per physical channel3 virtual channels per physical channel Each virtual channel buffer holds 8 flitsEach virtual channel buffer holds 8 flits 1 cycle central arbitration, 1 cycle switching1 cycle central arbitration, 1 cycle switching
Slide 20DUKE UNIVERSITY
Input TrafficInput Traffic
Packet Generation FrequencyPacket Generation Frequency ““attempt” to send one packet per packet regeneration attempt” to send one packet per packet regeneration
interval interval Traffic PatternsTraffic Patterns
Random destinationRandom destination Perfect Shuffle: aPerfect Shuffle: an-1n-1aan-2n-2... a... a11aa00 a an-2n-2aan-3n-3 ... a ... a00aan-1n-1
Butterfly: aButterfly: an-1n-1aan-2n-2... a... a11aa00 a a00aan-2n-2 … a … a11aan-1n-1
Bit Reversal: aBit Reversal: an-1n-1aan-2n-2... a... a11aa00 a a00aa11... a... an-2n-2aan-1n-1
Slide 21DUKE UNIVERSITY
Throttling AlgorithmsThrottling Algorithms
BaseBase no throttlingno throttling
ALO (At Least One)ALO (At Least One) Lopez, Martinez, and Duato, ICPP, August, 1998Lopez, Martinez, and Duato, ICPP, August, 1998 Throttling based on local estimation of congestionThrottling based on local estimation of congestion Inject new packet only ifInject new packet only if
– ““useful” physical channel has all virtual channels free, oruseful” physical channel has all virtual channels free, or– at least one virtual channel on every “useful” channel is freeat least one virtual channel on every “useful” channel is free
Tune (this work)Tune (this work)
Slide 22DUKE UNIVERSITY
Tuning ParametersTuning Parameters
Total number of network buffers = 256 * 3 * 4 = 3072Total number of network buffers = 256 * 3 * 4 = 3072 Gather time (g) = n * k * h / 2 = 32 cyclesGather time (g) = n * k * h / 2 = 32 cycles Sideband communication latency (h) = 2 cyclesSideband communication latency (h) = 2 cycles Sideband communication bandwidth = 25 bits (!)Sideband communication bandwidth = 25 bits (!)
# network buffers = 3072 = 12 bits# network buffers = 3072 = 12 bits max throughput = g * 256 * 1 = 8192 = 13 bitsmax throughput = g * 256 * 1 = 8192 = 13 bits
Tuning frequency = once every 96 cyclesTuning frequency = once every 96 cycles Initial threshold value = 1% ~= 30 buffersInitial threshold value = 1% ~= 30 buffers Threshold increment = 1%, decrement = 4%Threshold increment = 1%, decrement = 4%
Slide 23DUKE UNIVERSITY
Random PatternRandom Pattern
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 10 20 30 40 50 60 70 80 90 100
Packet Regeneration Interval (cycles)
Nor
mal
ized
Acc
epte
d Th
roug
hput
(F
lits/
Nod
e/C
ycle
)
10
100
1000
0 10 20 30 40 50 60 70 80 90 100
Packet Regeneration Interval (cycles)
Av
era
ge
Lat
en
cy (
cycl
es)
Tune
Base
ALO
Beyond saturation point, Tune outperforms ALO and Base
Slide 24DUKE UNIVERSITY
Delayed Collection of Global Knowledge Delayed Collection of Global Knowledge (h = 2, 3, 6 cycles)(h = 2, 3, 6 cycles)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 10 20 30 40 50 60 70 80 90 100
Packet Regeneration Interval (cycles)
No
rma
lize
d A
cc
ep
ted
Th
rou
gh
pu
t (F
lits
/No
de
/Cyc
le)
10
100
1000
0 10 20 30 40 50 60 70 80 90 100
Packet regeneration Interval (cycles)
Ave
rag
e L
ate
ncy
(cy
cle
s)
g=32 (h=2)
g=48 (h=3)
g=96 (h=6)
Tune fairly insensitive to delayed collection of information
Slide 25DUKE UNIVERSITY
Static Threshold ChoiceStatic Threshold Choice
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 10 20 30 40 50 60 70 80 90 100
Packet Regeneration Interval (cycles)
Ac
ce
pte
d T
hro
ug
hp
ut
(Flit
s/N
od
e/C
yc
le) Static Threshold = 250
Static Threshold = 50
Tune
Base
Uniform Random
Butterfly
Optimal Thesholds different for random and butterflyTune performs close to the best static threshold
Slide 26DUKE UNIVERSITY
With Bursty Load Tune outperforms ALOWith Bursty Load Tune outperforms ALO
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1000
0
1500
0
2000
0
2500
0
3000
0
3500
0
4000
0
4500
0
5000
0
5500
0
6000
0
Time (Cycles)
No
rma
lize
d T
hro
ug
hp
ut
(fli
ts/n
od
e/c
ycle
)
Tune
Base
ALO
random
bit reversal shufflebutterfly
Slide 27DUKE UNIVERSITY
Avoiding Local MaximaAvoiding Local Maxima
What if steady decrease in bandwidth < 25%?What if steady decrease in bandwidth < 25%? potential to “creep” into saturationpotential to “creep” into saturation
Solution: remember global maximaSolution: remember global maximamaxmax = maximum throughput seen in any tuning period = maximum throughput seen in any tuning period NNmaxmax = number of full buffers at = number of full buffers at maxmax
TTmaxmax = threshold at = threshold at maxmax
Reset threshold min(TReset threshold min(Tmaxmax, N, Nmaxmax) if throughput < 50% max) if throughput < 50% max
If “r” consecutive resets don’t fix the problem, then restartIf “r” consecutive resets don’t fix the problem, then restart hypothesis: communication pattern has changedhypothesis: communication pattern has changed
Slide 28DUKE UNIVERSITY
Threshold Reset NecessaryThreshold Reset Necessary
Packet Rengeration Interval = 10 cycles
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1000
0
1500
0
2000
0
2500
0
3000
0
3500
0
4000
0
4500
0
5000
0
5500
0
6000
0
Time (cycles)
Acc
epte
d Th
roug
hput
(F
lits/
node
/cyc
le)
0
200
400
600
800
1000
1200
1400
1600
10
00
0
15
00
0
20
00
0
25
00
0
30
00
0
35
00
0
40
00
0
45
00
0
50
00
0
55
00
0
60
00
0
Time (cycles)
Th
res
ho
ld
Hill Climbing
Hill Climbing + Local Maxima
Hill Climbing
Hill Climbing + Local Maxima
Slide 29DUKE UNIVERSITY
SummarySummary
Network Saturation is a severe problemNetwork Saturation is a severe problem advent of powerful processors, SMT, and CMPsadvent of powerful processors, SMT, and CMPs ““unstable” behavior makes designers nervousunstable” behavior makes designers nervous
We propose throttling based on global knowledgeWe propose throttling based on global knowledge aggregate global knowledge (% full buffers,throughput)aggregate global knowledge (% full buffers,throughput) throttle when % full buffers exceed thresholdthrottle when % full buffers exceed threshold tune threshold for communication patters & offered loadtune threshold for communication patters & offered load