View
227
Download
0
Embed Size (px)
Citation preview
1CIST560 by M. Hamdi
Packet Scheduling/Arbitration in Virtual Output Queues:
Maximal Matching Algorithms
(Part II)
2CIST560 by M. Hamdi
Pointer Desynchronization
• Performance: RRM < iSlip < FIRM
• Difference only in updating pointers
• Observation: iSlip and FIRM can effectively desynchronize their output pointers
• The best effect of pointer desynchronization is achieved if forced
3CIST560 by M. Hamdi
Static Round Robin Matching (SRR):To Achieve FULL Desynchronization
• Initialization. The input pointers are set to 0's. The output pointers are set to some initial pattern such that there is no duplication among the pointers.
• The 3 steps of one iteration are:– Request. Each input sends a request to every output for which it
has a queued cell.– Grant. If an output receives any requests, it chooses the one that
appears next in a fixed, round-robin schedule starting from the highest priority element. The output notifies each input whether or not its request was granted. The pointer to the highest priority element of the round-robin schedule is always incremented by one (modulo N) whether there is a grant or not.
4CIST560 by M. Hamdi
SRR (Cont’d)
– Accept. If an input receives a grant, it accepts the one that appears next in a fixed round-robin schedule starting from the highest priority element. The pointer to the highest priority element of the round-robin schedule is incremented (modulo N) to one location beyond the accepted one.
• In DSRR (Improved version of SRR), input pointers are also desynchronized.
• Rotating DSRR (RDSRR):– Unfairness among inputs under special traffic model.
– Outputs searching in clockwise and anti-clockwise directions alternatively to decide grants.
xx
xx
xx
xx
00
00
00
00
5CIST560 by M. Hamdi
Simulation Results
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
10
20
30
40
50
60
70
Normalized load
Rel
ativ
e av
erag
e de
lay
32x32 switch under uniform traffic
iSlipFIRM SRR DSRR RDSRR
6CIST560 by M. Hamdi
Simulation Results
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
5
10
15
20
25
30
35
40
45
Normalized load
Rel
ativ
e av
erag
e de
lay
32x32 switch under uniform bursty traffic
iSlipFIRM SRR DSRR RDSRR
7CIST560 by M. Hamdi
Simulation Results
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55100
101
102
103
104
Normalized load
Rel
ativ
e av
erag
e de
lay
32x32 switch under hotspot traffic
iSlipFIRM SRR DSRR RDSRR
8CIST560 by M. Hamdi
Simulation Results
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1100
101
102
103
104
Normalized load
Ave
rage
del
ay32x32 switch under unbalanced traffic
iSlipFIRM SRR DSRR RDSRR
9CIST560 by M. Hamdi
Stability Property
• A VOQ switch is considered stable if it approaches a steady state where the expected length of each VOQ is bounded. If it is stable, 100% throughput can be achieved under any admissible traffic pattern.
• RDSRR is more stable than iSlip and FIRM under various traffic patterns.
10CIST560 by M. Hamdi
Stability Property (Cont’d)
0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 10.94
0.95
0.96
0.97
0.98
0.99
1
1.01
Normalized load
Thr
ough
put
32x32 switch under unbalanced traffic
iSlip FIRM RDSRR Output
11CIST560 by M. Hamdi
3-Phase & 2-Phase Algorithms
• iSlip & FIRM are 3-phase algorithms: Request-Grant-Accept
• DRRM is 2-phase algorithm: Grant-Accept– Each input sends one grant
– Each output sends one accept
• 2-FIRM is the 2-phase version of FIRM
13CIST560 by M. Hamdi
3-Phase & 2-Phase Algorithms
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
10
20
30
40
50
60
70
Normalized load
Re
lati
ve
av
era
ge
de
lay
32x32 switch under uniform traffic
iSlip DRRM FIRM 2-FIRM
14CIST560 by M. Hamdi
3-Phase & 2-Phase Algorithms
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.5510
0
101
102
103
104
Normalized load
Re
lati
ve
av
era
ge
de
lay
32x32 switch under hotspot traffic
iSlip DRRM FIRM 2-FIRM
15CIST560 by M. Hamdi
3-Phase & 2-Phase Algorithms
• In general case, the traffic model changes from time to time
• When the temporary non-uniformity is on the input side, 3-phase scheme performs better
• When the temporary non-uniformity is on the output side, 2-phase scheme performs better
16CIST560 by M. Hamdi
2-stage Maximum Size Matching Algorithm: Description
• The 2-stage algorithm works in the following way: 1. The pointers at both input and output sides are kept fully desynchronized.
2. In each iteration, there are 3 steps:
Step 1: Each input sends a request to every output for which it has a queued cell.
Step 2: Each input selects one VOQ to send grant that appears next starting from its highest priority output. Each output selects one request received in step 1 to send grant that appears next starting from its highest priority input. OutputCount = number of outputs receiving grants from inputs. InputCount = number of inputs receiving grants from outputs.
17CIST560 by M. Hamdi
2-stage Maximum Size Matching Algorithm: Description
• Step 3: If OutputCount ? InputCount, each output selects one among the grants received in step 2 which appears next starting from its highest priority input and sends accept.
Else, each input selects one among the grants received in step 2 which appears next starting from its highest priority output and sends accept.
• In simple words, this algorithm will decide in each time slot whether to use 2-phase or 3-phase scheme based on which one can make more matches.
18CIST560 by M. Hamdi
2-stage Maximum Size Matching Algorithm: Hardware
ImplementationSt
ate
of I
npu
t Q
ueu
es
(N2
bit
s)
1
2
N
1
2
N
Dec
isio
n R
egis
ter
Grant Arbiters
Accept Arbiters
Output Counter
Input Counter
Comparator
1st group of inputs 2nd group of inputs 2 physical lines from comparator
19CIST560 by M. Hamdi
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
10
20
30
40
50
60
70
Normalized load
Rela
tive
ave
rage d
ela
y
32x32 switch under uniform traffic (1 iteration)
iSlip FIRM 2-StageSRR Output
Performance Evaluation: Simulation StudyU
nif
orm
Tra
ffic
20CIST560 by M. Hamdi
Performance Evaluation: Simulation Study
Load 0.5 0.6 0.7 0.8 0.9 0.95 0.99
Improvement
Percentage 67% 196% 81% 58% 60% 84% 43%
Normalized Improvement Percentage
40% 66% 45% 37% 37% 46% 30%
Improvement Factor
1.67 2.96 1.81 1.58 1.60 1.84 1.43
Improvement Percentage
7% 75% 92% 54% 59% 83% 43%
Normalized Improvement Percentage
7% 43% 48% 35% 37% 45% 30%
Improvement Factor
1.07 1.75 1.92 1.54 1.59 1.83 1.43
2-s
tag
e
over
iSlip
SR
R
over
iSlip
21CIST560 by M. Hamdi
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
5
10
15
20
25
30
35
40
45
Normalized load
Rela
tive
ave
rage d
ela
y
32x32 switch under uniform bursty traffic (1 iteration)
iSlip FIRM 2-StageSRR Output
Performance Evaluation: Simulation StudyB
urs
ty
Tra
ffic
22CIST560 by M. Hamdi
Load 0.63 0.7 0.75 0.8 0.85 0.9
Improvement
Percentage 213% 96% 70% 46% 28% 16%
Normalized Improvement Percentage
68% 49% 41% 31% 22% 14%
Improvement Factor
3.13 1.96 1.70 1.46 1.28 1.16
Improvement Percentage
89% 56% 46% 33% 22% 14%
Normalized Improvement Percentage
47% 36% 32% 25% 18% 12%
Improvement Factor
1.89 1.56 1.46 1.33 1.22 1.14
Performance Evaluation: Simulation Study
2-s
tag
e
over
iSlip
SR
R
over
iSlip
23CIST560 by M. Hamdi
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.5510
0
101
102
103
104
Normalized load
Rela
tive
ave
rage d
ela
y
32x32 switch under hotspot traffic (1 iteration)
iSlip FIRM 2-StageSRR Output
Performance Evaluation: Simulation StudyH
ots
pot
Tra
ffic
24CIST560 by M. Hamdi
Load 0.31 0.38 0.43 0.46 0.50
Improvement
Percentage 26% 56% 101626% 160469% 81633%
Normalized Improvement Percentage
21% 36% 100% 100% 100%
Improvement Factor
1.26 1.56 1017.26 1605.69 817.33
Improvement Percentage
5% 9% 56177% 74631% 19618%
Normalized Improvement Percentage
5% 8% 99% 100% 99%
Improvement Factor
1.05 1.09 562.77 747.31 197.18
Performance Evaluation: Simulation Study
2-s
tag
e
over
iSlip
SR
R
over
iSlip
25CIST560 by M. Hamdi
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110
0
101
102
103
Normalized load
Rela
tive
ave
rage d
ela
y
32x32 switch under cross-shaped traffic (1 iteration)
iSlip FIRM 2-StageSRR Output
Performance Evaluation: Simulation Study
Un
bala
nced
Tra
ffic
26CIST560 by M. Hamdi
Performance Evaluation: Simulation Study
Load 0.5 0.6 0.7 0.8 0.9 0.95 0.99
Improvement
Percentage 12% 39% 53% 142% 552% 8040% 3351%
Normalized Improvement Percentage
11% 28% 35% 59% 85% 99% 97%
Improvement Factor
1.12 1.39 1.53 2.42 6.52 81.40 34.51
Improvement Percentage
4% 35% 74% 225% 843% 11494% 3499%
Normalized Improvement Percentage
4% 26% 43% 69% 89% 99% 97%
Improvement Factor
1.04 1.35 1.74 3.25 9.43 115.94 35.99
2-s
tag
e
over
iSlip
SR
R
over
iSlip
27CIST560 by M. Hamdi
A new algorithm – RDESRR
• Real Desynchronized Round Robin Model (RDESRR)• Based on 2 phases RRM model (Request and Grant)• Add a small share memory that each outputs can
read/write (called Share Bits)• The size of the memory is 1 bit per input• If the bit is set, the corresponding input has already
granted by an output• If the bit is not set, the output may grant to
corresponding input port
28CIST560 by M. Hamdi
RDESRR Conceptual model
0
1
2
3
0
1
2
3
3 02 1
3 02 1
3 02 1
3 02 1
3
0
1
2
Share Bits
29CIST560 by M. Hamdi
RDESRR model• 2 phases only
• Request. Each input sends a request to every output for which it has a queued cell.
• Grant. If an output receives any requests, it chooses the one that appears next in a fixed, round-robin schedule starting from the highest priority element. The output check the corresponding bit is set or not, if not set, the output will set the bit and notifies the input its request was granted. Otherwise, the output will look for next request until all requests has gone through. The pointer gi to the highest priority element of the round-robin schedule is incremented (modulo N) to one location beyond the granted input. If no request is received, the pointer stays unchanged.
31CIST560 by M. Hamdi
RDESRR Demo – Add a share memory in Output
Step 2: Grant
0
1
2
3
0
1
2
3
3 02 1
3 02 1
3 02 1
3 02 1 3
0
1
2
Share Bits
•Add a small share memory that each outputs can read/write (called Share Bits)
32CIST560 by M. Hamdi
3 02 1
3 02 1
3 02 1
3 02 1
RDESRR Demo – Output check the share bits
0
1
2
3
0
1
2
3
Step 2: Grant
3
0
1
2
Share Bits
•The output check the corresponding bit is set or not
33CIST560 by M. Hamdi
RDESRR Demo – When share bit is occupied
0
1
2
3
0
1
2
3
Step 2: Grant
3
0
1
2
3 02 1
3 02 1
3 02 1
3 02 1
Share Bits
•if not set, the output will set the bit and notifies the input its request was granted•The share bit is First Come First Serve
34CIST560 by M. Hamdi
RDESRR Demo – Output looks for next request
0
1
2
3
0
1
2
3
Step 2: Grant
3 02 1
3 02 1
3 02 1
3 02 1 3
0
1
2
Share Bits
•If set, the output will look for next request until all requests have gone through
35CIST560 by M. Hamdi
RDESRR Demo – All share bits are allocated
0
1
2
3
0
1
2
3
Step 2: Grant
3 02 1
3 02 1
3 02 1
3 02 1 3
0
1
2
Share Bits
•Fully allocate the share bit will result for fully grant all input request
36CIST560 by M. Hamdi
3 02 1
3 02 1
3 02 1
3 02 1
RDESRR Demo – Pointer update/Share bit reset
0
1
2
3
0
1
2
3
3
0
1
2
Share Bits
•The pointer gi to the highest priority element of the round-robin schedule is incremented (modulo N) to one location beyond the granted input•If no request is received, the pointer stays unchanged•Share bits are also reset
37CIST560 by M. Hamdi
SIM Results• Run the test for 32x32 port in SIM using –l 1000000
Total Latency Avg Match Size0.1 0.0588 3.1958 0.2 0.1447 6.3938 0.3 0.2686 9.5947 0.4 0.4501 12.7940 0.5 0.7198 15.9960 0.6 1.1398 19.1980 0.7 1.8636 22.3961 0.8 3.2619 25.5986 0.9 7.5087 28.8003 1.0 715.5900 31.9850
RDESRR
38CIST560 by M. Hamdi
Input QueueingLongest Queue First or
Oldest Cell First
1234
1234
1234
1234
10 1
1
1
1 10
Maximum weight
Weight Waiting Time
100%Queue Length { } =
39CIST560 by M. Hamdi
Input QueueingWhy is serving long/old queues better than serving
maximum number of queues?
• When traffic is uniformly distributed, servicing themaximum number of queues leads to 100% throughput.
• When traffic is non-uniform, some queues become longer than others.
• A good algorithm keeps the queue lengths matched, and
services a large number of queues.
VOQ #
Avg
Occ
upan
cy Uniform traffic
VOQ #
Avg
Occ
upan
cy
Non-uniform traffic
40CIST560 by M. Hamdi
Maximum/Maximal Weight Matching
• 100% throughput for admissible traffic (uniform or non-uniform)
• Maximum Weight Matching– OCF (Oldest Cell First): w=cell waiting time
– LQF (Longest Queue First):w=input queue occupancy
– LPF (Longest Port First):w=QL of the source port + Sum of QL form the source port to the destination port
• Maximal Weight Matching (practical algorithms)– iOCF
– iLQF
– iLPF (comparators in the critical path of iLQF are removed )
41CIST560 by M. Hamdi
Maximal Weight Matching Algorithms: iLQF
• Request. Each unmatched input sends a request word of width bits to each output for which it has a queued cell, indicating the number of cells that it has queued to that output.
• Grant. If an unmatched output receives any requests, it chooses the largest valued request. Ties are broken randomly.
• Accept. If an unmatched input receives one or more grants, it accepts the one to which it made the largest valued request. Ties are broken randomly.
42CIST560 by M. Hamdi
Maximal Weight Matching Algotithms: iLQF
• The i-LQF algorithm has the following properties:
• Property 1. Independent of the number of iterations, the longest input queue is always served.
• Property 2. As with i-SLIP, the algorithm converges in at most logN iterations.
• Property 3. For an inadmissible offered load, an input queue may be starved.
43CIST560 by M. Hamdi
Maximal Weight Matching Algotithms: iOCF
• The i-OCF algorithm works in similar fashion to iLQF, and has the following properties:
• Property 1. Independent of the number of iterations, the cell that has been waiting the longest time in the input queues (it must at the head of the queue)
• Property 2. As with i-LQF, the algorithm converges in at most logN iterations.
• Property 3. No input queue can be starved indefinitely.
• Property 4. It is difficult to keep time stamps on the cells.
46CIST560 by M. Hamdi
Other research efforts
• Packet-based arbitration• Exhaustive-based arbitration• Numerous other efforts
47CIST560 by M. Hamdi
Packet Scheduling/Arbitration in Virtual Output Queues:Randomized Algorithms
and Others
48CIST560 by M. Hamdi
Input-Queued Packet Switch
Crossbar
Scheduler
inputs
outputs
1
N
1 N
.
.
.
.
. . . .
i,j
N,N
1,
1
Xi,j
(i i i,j < 1 ; j j i,j < 1)
50CIST560 by M. Hamdi
Stability of Scheduling
Definition:
Let Xi,j(t) be the number of packets queued at input i for output j at time-slot t.
Then an algorithm is stable iff:
)(
, , tXEji ji
51CIST560 by M. Hamdi
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
Maximum size matching
Maximum weight matching
1
2
3
4
1
2
3
4
8
6
4
2
1
3
1
1
2
3
4
1
2
3
4
8
6
4
Maximum Matching in VOQ Architecture
52CIST560 by M. Hamdi
Complexity of Maximum Matchings
• Maximum Size/Cardinality Matchings:– It is not a stable algorithm
– Algorithm by Dinic O(N5/2)
• Maximum Weight Matchings– Algorithm by Kuhn O(N3logN)
– It is a stable algorithm
• In general:– Hard to implement in hardware (does not lend itself to
simple hardware implementation not because of its serial time complexity)
– Slooooow.
53CIST560 by M. Hamdi
Maximal Matching Algorithms
• Maximal matching algorithms are heuristic algorithms that try to approximate MSM or MWM.
• In general, maximal matching is much simpler to implement (Not because of its time complexity), and has a much faster running time.
• A maximal size matching is at least half the size of a maximum size matching.
• A maximal weight matching is at least half the size of a maximum weight matching.
54CIST560 by M. Hamdi
Maximal Size Matching Algorithm: Performance and Properties
• Can have 100% throughtput under uniform traffic
• They converge in logN iterations to a maximal size matching
• Their performance can be quite good (close to an ideal Output Queued Switch) with multiple iterations
• The best iterative maximal size matching algorithm takes O(N2logN) serial or O(log N) parallel time steps.
• If the number of iterations is constant, then it can be implemented in constant time (that is why it is practical).
55CIST560 by M. Hamdi
Sta
te o
f In
pu
t Q
ueu
es (
N 2 b
its)
1
2
N
1
2
N
Dec
isio
n R
egis
ter
Grant Arbiters Request Arbiters
Implementation of the parallel maximal matching algorithms
56CIST560 by M. Hamdi
Small Differences (in implementation) between RRM, iSlip & FIRM
But large difference in performance
RRM iSlip FIRM
Input No grant unchanged
Granted one location beyond the accepted one
Output
No request unchanged
Grant accepted
one location beyond the granted one
Grant not accepted
one location beyond the previously granted one
unchanged the granted one
57CIST560 by M. Hamdi
Maximum/Maximal Weight Matching
• 100% throughput for admissible traffic (uniform or non-uniform)
• Maximum Weight Matching– OCF (Oldest Cell First): w=cell waiting time
– LQF (Longest Queue First):w=input queue occupancy
– LPF (Longest Port First):w=QL of the source port + Sum of QL form the source port to the destination port
• Maximal Weight Matching (practical iterative algorithms)
• Make these maximal weight matching algorithms operate like iSLIP– iOCF
– iLQF
– iLPF
58CIST560 by M. Hamdi
Maximal Weight Matching Algorithms: iLQF
• Request. Each unmatched input sends a request word of width bits to each output for which it has a queued cell, indicating the number of cells that it has queued to that output.
• Grant. If an unmatched output receives any requests, it chooses the largest valued request (has the longest queue). Ties are broken randomly.
• Accept. If an unmatched input receives one or more grants, it accepts the one to which it made the largest valued request (has the longest queue). Ties are broken randomly.
59CIST560 by M. Hamdi
Maximal Weight Matching Algotithms: iLQF
• The i-LQF algorithm has the following properties:
• Property 1. Independent of the number of iterations, the longest input queue is always served.
• Property 2. As with i-SLIP, the algorithm converges in at most logN iterations.
• Property 3. For an inadmissible offered load, an input queue may be starved.
• Property 4. It is a stable algorithm.
60CIST560 by M. Hamdi
Maximal Weight Matching Algotithms: iOCF
• The i-OCF algorithm works in similar fashion to iLQF, and has the following properties:
• Property 1. Independent of the number of iterations, the cell that has been waiting the longest time in the input queues (it must at the head of the queue)
• Property 2. As with i-LQF, the algorithm converges in at most logN iterations.
• Property 3. No input queue can be starved indefinitely.
• Property 4. It is difficult to keep time stamps on the cells.
62CIST560 by M. Hamdi
MotivationMotivation• Networking problems suffer from the “curse of
dimensionality”– algorithmic solutions do not scale well
• Typical causes– size: large number of users or large number of I/O
– time: very high speeds of operation
• A good deterministic algorithm exists (Max Flow), but …– it requires too large a data structure
– it needs state information, and “state” is too big
– it “starts from scratch” in each iteration
63CIST560 by M. Hamdi
Randomization• Randomized algorithms have frequently been used in many
situations where the state space (e.g., different number of connections between input and output N!) is very large
• Randomized algorithms– are a powerful way of approximating
– it is often possible to randomize deterministic algorithms
– this simplifies the implementation while retaining a (surprisingly) high level of performance
• The main idea is – to simplify the decision-making process
– by basing decisions upon a small, randomly chosen sample of the state
– rather than upon the complete state
64CIST560 by M. Hamdi
An Illustrative ExampleFind the largest element of a set S of size 1 billion
• Deterministic algorithm: linear search – has a complexity of 1 billion
• The randomized version: find the largest of 10 randomly chosen samples– has a complexity of 10
– (note: this ignores complexity of choosing 10 random samples)
• Performance– linear search will find the absolute largest element
– if R is the element found by randomized algorithm, we can make statements like
P(R is at least the 100 millionth largest element) = thus, we can say that the performance of the randomized algorithm is very
good with a high probability
101
110
65CIST560 by M. Hamdi
Randomizing Iterative Schemes (e.g., iSLIP)
• Often, we want to perform some operation iteratively• Example: find the heaviest matching in a switch in every time
slot• Since, in each time slot
– at most one packet can arrive at each input– and, at most one packet can depart from each output the size of the queues, or the “state” of the switch, doesn’t change by
much between successive time slots so, a matching that was heavy at time t will quite likely continue to be
heavy at time t+1
• This suggests that– knowing a heavy matching at time t should help in determining a heavy
matching at time t+1 there is no need to start from scratch in each time slot
66CIST560 by M. Hamdi
Summarizing Randomized Algorithms
• Randomized algorithms can help simplify the implementation– by reducing the amount of work in each iteration
• If the state of the system doesn’t change by much between iterations, then– we can reduce the work even further by carrying information
between iterations
• The big pay-off is that, even though it is an approximation, the performance of a
randomized scheme can be surprisingly good
67CIST560 by M. Hamdi
Randomized Scheduling Algorithms: Example
• Consider a 3 x 3 input-queued switch – input traffic: is Bernoulli IID and λij = α/3 for all i, j, and
α < 1
– This is admissible
– note: there are a total of 6 (= 3!) possible service matrices
111
111
111
3/
3/3/3/
3/3/3/
3/3/3/
100
010
001
010
100
001
100
001
010
001
100
010
010
001
100
001
010
100
68CIST560 by M. Hamdi
Random Scheduling Algorithms
• In time slot n, let S(n) be equal to one of the 6 possible matchings independently and uniformly at random
• Stability of Random – Consider L11(n), the number of packets in VOQ11
• arrivals to VOQ11 occur according to A11(n), which is Bernoulli IID • input rate = λ11 = α/3 • this queue gets served whenever the service matrix connects input 1 to
output 1 • There are 2 service matrices that connect input 1 to output 1 • since Random chooses service matrices u.a.r., input 1 is connected to
output 1 1. for a fraction of time = 2/6 = 1/3 --- the service rate between input1 and output1
• E(L11(n)) < iff λ11 < 1/3 α < 1
• This random algorithm is stable.
69CIST560 by M. Hamdi
Random Scheduling Algorithms
• Instability of Random • Now suppose λii = α for all i and λij =0 for
– clearly, this is admissible traffic for all α < 1
– but, under Random, the service rate at VOQ11 is 1/3 at best
– hence VOQ11 and the switch will be unstable as soon as
• Stability (or 100% throughput) means it is stable under all admissible traffic!
ji
3/1
70CIST560 by M. Hamdi
• Switch Size : 32 x 32
• Input Traffic (shown for a 4 X 4 switch) – diagonal load matrix:
• normalized load=x+y<1
• x=2y
• It is a good test-case
Simulation Scenario
xy
yx
yx
yx
00
00
00
00
71CIST560 by M. Hamdi
Obvious Randomized Schemes
• Choose a matching at random and use it as the schedule doesn’t give 100% throughput (already shown)
• Choose 2 matchings at random and use the heavier one as the schedule
• Choose N matchings at random and use the heaviest one as the schedule
None of these can give 100% throughput !!
72CIST560 by M. Hamdi
0.001
0.01
0.1
1
10
100
1000
10000
0.0 0.2 0.4 0.6 0.8 1.0
Mea
n IQ
Len
Normalized Load
Diagonal Traffic
MWM R32R1
74CIST560 by M. Hamdi
Iterative Randomized Scheme(Tassiulas)
• Say M is the matching used at time t
• Let R be a new matching chosen uniformly at random (u.a.r.) among the N! different matchings
• At time t+1, use the heavier of M and R• Complexity is very low O(1) iterations • This gives 100% throughput !
note the boost in throughput is due to memory (saving previous matchings)
• But, delays are very large
75CIST560 by M. Hamdi
0.01
0.1
1
10
100
1000
10000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Mea
n IQ
Len
Normalized Load
Diagonal Traffic
MWMTassiulas
76CIST560 by M. Hamdi
Observations for Improvement
• Most of the weight of a matching is carried in a small number of edges
• Hence, remember edges not matchings• We can have 100% throughput under all
admissible traffic.
77CIST560 by M. Hamdi
0.01
0.1
1
10
100
1000
10000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Mea
n IQ
Len
Normalized Load
Diagonal Traffic
MWMR32M32 R1M1 Tassiulas
78CIST560 by M. Hamdi
Finer Observations
• Let M be schedule used at time t
• Choose a “good’’ random matching R
• M’ = Merge(M,R)
• M’ includes best edges from M and R
• Use M’ as schedule at time t+1
• Above procedure yields algorithm called LAURA• There are many other small variations to this algorithm.
79CIST560 by M. Hamdi
3
2
3
2
2
1
2
3
4
1Merging
3
2
3
3
1
X R3-1+2-2=2
2-1+2-4=-1
W(X)=12 W(R)=10
M
W(M)=13
Merging Procedure
80CIST560 by M. Hamdi
0.01
0.1
1
10
100
1000
10000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Mea
n IQ
Len
Normalized Load
Diagonal Traffic
MWMM-LAURA LAURAiLQFTassiulas
82CIST560 by M. Hamdi
Recap:Recap: Two Successive Scaling Problems Two Successive Scaling Problems
OQ routers: + work-conserving (QoS)- memory bandwidth =
(N+1)RR
R
RR
IQ routers: + memory bandwidth = 2R- arbitration complexity
Bipartite Matching
R R
83CIST560 by M. Hamdi
Today: 64 ports at 10Gbps, 64-byte cells.
• Arbitration Time = = 51.2ns
• Request/Grant Communication BW = 17.5Gbps
10Gbps 64bytes
IQ Arbitration Complexity
Two main alternatives for scaling:1. Increase cell size2. Eliminate arbitration
Scaling to 160Gbps:• Arbitration Time = 3.2ns• Request/Grant Communication BW = 280Gbps
84CIST560 by M. Hamdi
Desirable Characteristics for Router Architecture
Ideal: OQ• 100% throughput• Minimum delay• Maintains packet order
Necessary: able to regularly connect any input to any output
What if the world was perfect? Assume Bernoulli iid uniform arrival traffic...
85CIST560 by M. Hamdi
Round-Robin Scheduling
• Uniform & non-bursty traffic => 100% throughput• Problem: traffic is non-uniform & bursty
86CIST560 by M. Hamdi
Two-Stage Switch (I)
1
N
1
N
1
N
External Outputs
Internal Inputs
External Inputs
First Round-Robin Second Round-Robin
87CIST560 by M. Hamdi
Two-Stage Switch (I)
1
N
1
N
1
N
External Outputs
Internal Inputs
External Inputs
First Round-Robin Second Round-Robin
Load Balancing
88CIST560 by M. Hamdi
100% throughputProblem: unbounded mis-sequencing
External Outputs
Internal Inputs
1
N
ExternalInputs
Cyclic Shift Cyclic Shift
1
N
1
N
11
2
2
Two-Stage Switch Characteristics
89CIST560 by M. Hamdi
Two-Stage Switch (II)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
F ik
F ik
.
.
.
.
.
.
.
FlowSplitter
LoadBalancer VOQs First-Stage Round-Robin Second-Stage Round-RobinVOQs
External inputs Internal outputs Internal inputs External outputs
1 1 1
N N N
1
N
1
N
i
.
.
.
.
.
.
.
.
.
.
.
.
j
.
.
.
.
.
.
.
.
.
.
.
.
j
.
.
.
.
.
.
.
.
.
.
.
.
k
.
.
.
.
.
.
.
.
.
.
.
.
New
N3 instead of N2
90CIST560 by M. Hamdi
Expanding VOQ Structure
Solution: expand VOQ structure by distinguishing among switch inputs
2
1
3
a
b
91CIST560 by M. Hamdi
What is being done in practice(Cisco for example)
• They want schedulers that achieve 100% throughput and very low delay (Like MWM)
• They want it to be as simple as iSLIP in terms of hardware implementation
• Is there any solution to this !!!!!
93CIST560 by M. Hamdi
What is being done in practice(Cisco for example)
Company Switching Capacity
Switch Architecture
Fabric Overspeed
Agere 40 Gbit/s-2.5 Tbit/s Arbitrated crossbar 2x
AMCC 20-160 Gbit/s Shared memory 1.0x
AMCC 40 Gbit/s-1.2 Tbit/s Arbitrated crossbar 1-2x
Broadcom 40-640 Gbit/s Buffered crossbar 1-4x
Cisco 40-320 Gbit/s Arbitrated crossbar 2x