23
Case study: InfiniBand CM Parameter Tuning Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE 802 Plenary Session Dallas, Nov. 2006

Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

Case study: InfiniBand CM Parameter Tuning

Mitch Gusat and Wolfgang Denzelwith IBM team

IBM ZRL

IEEE 802 Plenary SessionDallas, Nov. 2006

Page 2: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 2

Outline

• Problem: Congestion in IBA Networks

• Solution (elements thereof): CCA

• Need: Tune the CCA parameters

• Method: ZRL Congestion Benchmarking

• Simulation results and conclusions

Page 3: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 3

Problem: Hotspot Congestion in Lossless ICTNs => Saturation Trees

shortcut routes within same switch elements (SE)

same SEs

port 128

port 4

port 1HCA

HCA

. .

. .

.

HCA

HCA

. .

. .

.

16 8x8 SEs

. .

..

. .

. .

.

32 4x4 half SEs

.

.

.

.

32 4x4 half SEs

.

.

.

.

32 4x4 half SEs

.

.

.

.

32 4x4 half SEs

.

.

.

.

SourceHost

ChannelAdapters

DestinationHost

ChannelAdapters

“Inn

ocen

t”tr

affic

+ 1

Hot

spot

hotspot

Page 4: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

Zurich Research Laboratory

Congestion Control © 2005 IBM Corporation

Effect: Global collapse of throughput caused by single hotspot

shortcut routes within same switch elements (SE)

same SEs

port 128

port 4

port 1HCA

HCA

. .

. .

.

HCA

HCA

. .

. .

.

16 8x8 SEs

. .

..

. .

. .

.

32 4x4 half SEs

.

.

.

.

32 4x4 half SEs

.

.

.

.

32 4x4 half SEs

.

.

.

.

32 4x4 half SEs

.

.

.

.

SourceHost

ChannelAdapters

DestinationHost

ChannelAdapters

Too

muc

h tr

affic

to h

otsp

ot

hotspot

0

0.2

0.4

0.6

0.8

1

0 2ms 4ms 6ms 8ms 10ms

Thro

ughp

ut (

for e

ach

outp

ut)

Time

congestion start congestion end

300% hotspot oversubscriptionduring congestion period

all time constant offered load = 0.9

throughput of hotspot output

saturates

throughput of other outputsbreaks down

Page 5: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

Zurich Research Laboratory

Congestion Control © 2005 IBM Corporation

Solution (elements thereof): IBA Congestion Control Annex (CCA)

Source HCASwitch

congestionthresholdIndex

FECNDestination

HCA

Congestion Control Tablecontaining inter-packet delays (IPD)

IPD−

+

Timer

BECNBECN

IPD

packet packet

Forward ExplicitCongestion Notification

Backward ExplicitCongestion NotificationParameters

Switch queue threshold (~Qeq): sw_th– congestion detection and FECN marking

IPD table size = 64-256 entries – dynamic range of per flow rate control

highest IPD entry: max_ipd – largest inter-packet gap = 1/Rmin

IPD index increment: ipd_idx_incr– step on each new BECN => rate control granularity

IPD recovery timer: rec_time– timeout of the autonomic rate increase timer

Page 6: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 6

Questions

1. Does it work?

2. Tuning: What parameter settings for which cases?=> exploration of a combinatorial sub-space

3. Where are the limits of its operation?

Page 7: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 7

Does it work?

• Qualified “yes” => needs tuningeasy for small fabrics w/ simple traffic, hard for others...

• Param tuning required per (1) fabric architecture and (2) traffic pattern

• Fragile stability (stiffness): CCA sensitivity to traffic and params...

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

Tuned parameters (almost..)Un-tuned congestion control

Page 8: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 8

Early Observations• With min. pkt size under high background load, ECN traffic can

lead to instability

• A hotspot relay destination receiving permanent short-packet traffic may be overloaded with F/BECN signaling

• BECN and ACK packets can be caught in reverse congestion

• Reaction delay improvable by direct BCN vs. the CCA mechanism

• CCA operation is sensitive to parameters Oscillations observed... seem to be controllable

• Parameter settings strongly depend on fabric and traffic!

Page 9: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 9

Simulation Model

• Switching network4x4..12x12 input-buffered CIOQ switchesCredit-based flow controlMultistage bidirectional fat-tree networkSource routing: Static or dynamic (shortest path)

• Edge adapter (HCA) operationQueue pairs, credit-controlled round-robin scheduling with rate ctrl. triggered by CMDestination will ACK (64B pkt) after successful reception of user packets (MTUs)Return of short BECN packets if FECN bit is set

• CM mechanism

Switch action: Output buffer threshold with hysteresis -> FECN bit set in packets leaving switch module and BECN packets returned from destination HCA

SRC HCA action: Rate control according to CCA.

Page 10: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

Zurich Research Laboratory

Congestion Control © 2005 IBM Corporation

Baseline Switch: Xbar-based CIOQ with Input Buffering

sche

dule

r

sche

dule

r

sche

dule

r

In[1]

In[2]

In[n]

Out[1]

Out[2]

Out[n]

. . . . .

. . . . .

. . . . .

. . .

. .

. . .

. .

. . .

. .

Logi

cal O

utpu

t Que

ue

Logi

cal O

utpu

t Que

ue

Logi

cal O

utpu

t Que

ue

Physical Input Buffer

Physical Input Buffer

Physical Input Buffer

Switch ArchitectureIN buffer = 72KB

Output service: RR

Switch element (SE) described in Omnet++ #include <omnetpp.h>#include <packetdefinition.h>

class Switch : public cSimpleModule{Module_Class_Members( Switch, cSimpleModule, 0 )< some parameter and variable definitions …….. >virtual void initialize();virtual void handleMessage( cMessage *msg );virtual void finish();

}

Define_Module( Switch );

void Switch::initialize(){

< some initializations, allocation of queue arrays, schedule first SCHEDULE_EVENT, …….. >

void Switch::handleMessage(cMessage *msg){switch( msg->kind() ){case PACKET: < en-queue received packet …….. >case SCHEDULE_EVENT: < de-queue and send packet, return credit,

update scheduler pointer, schedule nextSCHEDULE_EVENT, ….. >

case CREDIT: < return credit to credit pool …….. >}

}

void Switch::finish(){}

3. performed after simulation run

Typical 3-method structure

1. performed before simulation run

2. handling of received messages during simulation run

Page 11: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 11

Baseline Fabric: MIN Topology => Bidir Fat Tree

• 2-level / 3-stage bidir MIN

• Simulated: 8 – 32 nodes

• Time per run: 2-30 min.

shortcut routes within same modules

same modules

port 32

port 4

port 1

port 4

port 1

8 4x4 modules 8 4x4 modules4 8x8 modules

HCATx

HCATx

HCARx

HCARxport 32

shortcut routes within same modules

32 4x4 modules

same modules

port 128

port 4

port 1

port 4

port 1

32 4x4 modules 32 4x4 modules32 4x4 modules

16 8x8 modules

HCATx

HCATx

HCARx

HCARxport 128

• 3-level / 5-stage bidir MIN

• Simulated: 128 – 432 nodes

• Time per run: 20-500 min.

Page 12: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 12

Traffic: ZRL Congestion Benchmark• Source nodes generate one or more hotspots according to matrix

[λij_hot]: tp->q = αk_hot [λij_hot]:tp->q , [λij_hot] is specified per each case below

1. Congestion type: IN- or OUTput-generated

2. Hotspot severity: HSV = λaggr / μHS , λaggr = ∑ λi at hotspotted output, μHS = service rate of the HS

Mild HSV ~ 1 .. 2 Moderate HSV = 3 .. 7Severe HSV >> 10.

3. Hotspot degree: HSD is the fan-in of congestive tree at the measured hotspot

Small’ HSD <10% (of all sources inject hot traffic)Medium HSD 20..60%Large HSD >90%.

Page 13: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 13

Sensitivity Analysis: Combinatorial Tree

Via analysis and validated by simulation, rank CCA params by sensitivity:1. ipd_indx_incr2. rec_time3. ipd_max

Simulation subtree for a 128-node IB network

background load

0.5

0.9

IPD recovery timemin

max

10.5us

.

.

.

.

.

.

IPD index step

min

max

20

.

.

.

.

.

.

maximum IPD

min

max

84us

.

.

.

.

.

.

hotspot degree

(# hot flows)

3 32

3/16x16

5/8x8

# stages / SE size

.

.

.

.

.

.

.

.

.

• 1 sim. point: ~ 1hr• Exhaustive coverage: ~ 78 yrs.

output: ~ 23 PB of data...not feasible, therefore...

• Prune the tree by analytical pre-selection

Network: 32-node 3-stage/128-node 5-stage

Traffic:single HSV = moderate congestiontwo HSD = 3 and 32 flows ca. 800 cases

ca 800 cases were analyzed

Page 14: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 14

Sample Results: “Input-generated” CongestionSelection: 4 ‘corner’ cases

Offered load• pkt size = 2KB (all user traffic MTU)• negative-exponential interarrival times• background traffic destinations uniformly distributed, load:

a) 50%b) 90%

2 congestion patterns, both with HSV ~ 3• Small HSD:

During hotspot period, 3 sources send 100% of their load to DST #31• Large HSD:

During the hotspot all 32 sources send 10% of their load to DST #31

Page 15: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 15

Four corner cases: Un-tuned CM ~ No CM

large hotspot, 50% load

small hotspot, 50% load

large hotspot, 90% load

small hotspot, 90% load

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

Page 16: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

Zurich Research Laboratory

Congestion Control © 2005 IBM Corporation

With marginal settings of ipd_idx_incr and rec_time

0

.5

1

0 2ms 4ms 6ms 8ms 10ms

Thro

ughp

ut (

for o

utpu

ts 0

..31)

Time

start congestion end0

.5

1

0 2ms 4ms 6ms 8ms 10msTh

roug

hput

(fo

r out

puts

0..3

1)Time

start congestion end0

.5

1

0 2ms 4ms 6ms 8ms 10ms

Thro

ughp

ut (

for o

utpu

ts 0

..31)

Time

start congestion end0

.5

1

0 2ms 4ms 6ms 8ms 10ms

Thro

ughp

ut (

for o

utpu

ts 0

..31)

Time

start congestion end

Corner case: 4

Load: High (0.9)Hotspot degree: Large (32)

Out-of-range parametersIPD index step too low (=2) IPD index step too high (=40) IPD recovery timer too fast (=2.6us) IPD recovery timer too slow (=84us)

32x32 3-stage (2-level) fat-tree Common parameters: Packet size: 2 KB Hotspot severity: 300%

oscillations breakdown hot flow suffering

Throughput: with Congestion Control:

oscillations

Focus: 4th corner case

Page 17: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 17

Tuning for IG congestion in 32 to 128-node fabrics

32-node• sw_th = 90% of switch input buffer size• max_ipd = 1000 ( = 21µs )

from fairness under worst case => IPD sufficient for all other N-1 sources flows to access a share (1 pkt) of the HS bandwidth

• IPD table index increment ipd_idx_incr = 5 (sweetspot)• IPD recovery time rec_time = 500 ( = 10.5µs )

128-node• max_ipd and ipd_idx_incr ~ 4x larger than above

linear scalability confirmed by simulations

Page 18: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 18

Output-generated congestion with CM tuned for IG

ipd

50% load

ipd

90% load

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

0

200

400

600

800

1000

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

IPD[31] IPD[30] IPD[29] IPD[28] IPD[27] IPD[26] IPD[25] IPD[24] IPD[23] IPD[22] IPD[21] IPD[20] IPD[19] IPD[18] IPD[17] IPD[16] IPD[15] IPD[14] IPD[13] IPD[12] IPD[11] IPD[10] IPD[9] IPD[8] IPD[7] IPD[6] IPD[5] IPD[4] IPD[3] IPD[2] IPD[1] IPD[0]

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

0

200

400

600

800

1000

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

IPD[31] IPD[30] IPD[29] IPD[28] IPD[27] IPD[26] IPD[25] IPD[24] IPD[23] IPD[22] IPD[21] IPD[20] IPD[19] IPD[18] IPD[17] IPD[16] IPD[15] IPD[14] IPD[13] IPD[12] IPD[11] IPD[10] IPD[9] IPD[8] IPD[7] IPD[6] IPD[5] IPD[4] IPD[3] IPD[2] IPD[1] IPD[0]

Page 19: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 19

OG w/ “Big Bang” control50% load 90% load

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

Parameters for OG:1. ipd table filing: multiplicative (vs. linear in the IG)2. max_ipd = 10000 ( = 210us )3. recovery time: rec_time = 10000 ( = 210us )4. ipd increase: ipd_idx_incr[i] += 127• ipd decrease: ipd_idx_incr[i] -= 1

Page 20: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 20

OG w/ “Big Bang” control and enhancements

ipd

90% load, with false BECN filtering

ipd

90% load, 3x switch memory

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

• With above enhancements, different (from IG congestion) control algorithm, and, different table values,

• a (marginally) stable solution to OG congestion is feasible.

Page 21: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 21

Tuning Guidelines for IG CongestionN = network size, i.e., number of portsHSD = maximum hot spot degree

Worst case = N, but can be less if system partitioning known.Absolute time (µsec) values are w/ reference to our model’s RTTof 2-20 µsec. (w/o congestion)

Item Value

IPD Table size 128

Switch threshold for congestion detection 90% of queue capacity, with hysteresis

IPD table index increment Min(1/6•N, 1/2•HSD)

Max IPD value 2/3 HSD µsec.

Recovery timer 10 µsec.

Page 22: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 22

Limitations• Stability limits...?

Large hotspot degree (128 sources send 5% to hotspot) with 50% background load has no solution in the param space

• Input- and Output-generate congestions require two distinct control loops (table values and rate control algorithms)

Corollary: Any mixture of IG and OG must be controlled by the dominant case, although the solution might not be stable.

• Why? • Under-sampling

Generally not enough true BECNs for convergence Next problem: false BECNs

» That’s all.

Page 23: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

Contributors

N. Ni†, G. Pfister†, C. Minkenberg*, T. Engbersen*, D. Craddock§, W. Rooney§, R. Luijten* and T. Schlipf‡

* IBM Research GmbH, Rüschlikon, Switzerland; § IBM Systems and Technology Group, Poughkeepsie, NY USA;

† IBM Systems and Technology Group, Austin, TX USA; ‡ IBM Systems and Technology Group Boeblingen, Germany.

Contact: [email protected]