Download pdf - Packet Multiple Access - MITmedard/6.02s/6.02sday4.pdf•New arrivals during the previous idle slot are also transmitted • With backlog n, the number of packets that attempt transmission

Eytan ModianoSlide 1

Packet Multiple Access

Eytan ModianoMassachusetts Institute of Technology


Multiple Access

• Shared Transmission Medium– a receiver can hear multiple transmitters– a transmitter can be heard by multiple receivers

• The major problem with multi-access is allocating the channelbetween the users; the nodes do not know when the other nodeshave data to send

– Need to coordinate transmissions


Examples of Multiple Access Channels

• Local area networks (LANs)– Traditional Ethernet– Recent trend to non-multi-access LANs

• satellite channels

• Multi-drop telephone

• Wireless radio

NET

DLC

PHY

MAC

LLC

• Medium Access Control (MAC)– Regulates access to channel

• Logical Link Control (LLC)– All other DLC functions


Approaches to Multiple Access

• Fixed Assignment (TDMA, FDMA, CDMA)– each node is allocated a fixed fraction of bandwidth– Equivalent to circuit switching– very inefficient for low duty factor traffic

• Contention systems– Polling

– Reservations and Scheduling

– Random Access


Aloha

Single receiver, many transmitters

Receiver

Transmitters

....

E.g., Satellite system, wireless


Slotted Aloha

• Time is divided into “slots” of one packet duration– E.g., fixed size packets

• When a node has a packet to send, it waits until the start of thenext slot to send it

– Requires synchronization• If no other nodes attempt transmission during that slot, the

transmission is successful– Otherwise “collision”– Collided packet are retransmitted after a random delay


Slotted Aloha Assumptions

• Poisson external arrivals• No capture

– Packets involved in a collision are lost– Capture models are also possible

• Immediate feedback– Idle (0) , Success (1), Collision (e)

• If a new packet arrives during a slot, transmit in next slot• If a transmission has a collision, node becomes backlogged

– while backlogged, transmit in each slot with probability qr untilsuccessful

• Infinite nodes where each arriving packet arrives at a new node– Equivalent to no buffering at a node (queue size = 1)– Pessimistic assumption gives a lower bound on Aloha performance


0 1 2 3

P

P

PP34

10

03

13

Markov chain for slotted aloha

• state (n) of system is number of backlogged nodes.

pi,i-1 = prob. of one backlogged attempt and no new arrival

pi,i =prob. of one new arrival and no backlogged attempts or nonew arrival and no success

pi,i+1= prob of one new arrival and one or more backlogged attempts

pi,i+j = Prob. Of J new arrivals and one or more backlogged attemptsor J+1 new arrivals and nobacklogged attempts

• Steady state probabilities do not exists– Backlog tends to infinity => system unstable– More later


slotted aloha

• let g(n) be the attempt rate (the expected number of packetstransmitted in a slot) in state n

g(n) = λ + nqr

• The number of attempted packets per slot in state n isapproximately a Poisson random variable of mean g(n)

– P (m attempts) = g(n)me-g(n)/m!– P (idle) = probability of no attempts in a slot = e-g(n)

– p (success) = probability of one attempt in a slot = g(n)e-g(n)

– P (collision) = P (two or more attempts) = 1 - P(idle) - P(success)


Throughput of Slotted Aloha

• The throughput is the fraction of slots that contain a successfultransmission = P(success) = g(n)e-g(n)

– When system is stable throughput must also equal the externalarrival rate (λ)

– What value of g(n)maximizes throughput?

– g(n) < 1 => too many idle slots– g(n) > 1 => too many collisions– If g(n) can be kept close to 1, an external arrival rate of 1/e packets

per slot can be sustained

d

dg(n)g(n)e!g( n) = e!g( n) ! g(n)e!g( n) = 0

" g(n) = 1

" P(success) =g(n)e!g( n) = 1/ e# 0.36


Instability of slotted aloha

• if backlog increases beyond unstable point (bad luck) then it tendsto increase without limit and the departure rate drops to 0

• Drift in state n, D(n) is the expected change in backlog over onetime slot

– D(n) = λ - P(success) = λ - g(n)e-g(n)


Stabilizing slotted aloha

• choosing qr small increases the backlog at which instabilityoccurs ( since g(n) = λ + nqr), but also increases delay (since meanretry time is 1/qr)

• solution: estimate the backlog (n) from past feedback– Given the backlog estimate, choose qr to keep g(n) = 1

Assume all arrivals are immediately backlogged g(n) = nqr , P(success) = nqr (1-qr)n-1

To maximize P(success) choose qr = min{1,1/n}– When the estimate of n is perfect:

idles occur with probability 1/e,successes with 1/e, andcollisions with 1-2/e.

– When the estimate is too large, too many idle slots occur– When the estimate is too small, too many collisions occur

• Nodes can use feedback information (0,1,e) to make estimates– A good rule is increase the estimate of n on each collision, and to

decrease it on each idle slot or successful slot note that the increase on a collision should be (e-2)-1 times as large as the

decrease on an idle slot


stabilized slotted aloha

• assume all arrivals are immediately backlogged– g(n) = nqr = attempt rate– p(success) = nqr (1-qr)n-1

for max throughput set g(n) = 1 => qr = min{1,1/n’}where n’ is the estimate of n

– Let nk = estimate of backlog after kth slot

max {λ, nk+λ-1} idle or successnk+1 =

nk+λ+(e-2)-1 collision

– Can be shown to be stable for λ < 1/e


TDM vs. slotted aloha

• Aloha achieves lower delays when arrival rates are low• TDM results in very large delays with large number of users, while

Aloha is independent of the number of users

0 0.2 0.4 0.6 0.8

ARRIVAL RATE

4

8

DELAY

ALOHA

TDM, m=8

TDM, m=16


Pure (unslotted) Aloha

• New arrivals are transmitted immediately (no slots)– No need for synchronization– No need for fixed length packets

• A backlogged packet is retried after an exponentially distributedrandom delay with some mean 1/x

• The total arrival process is a time varying Poisson process of rateg(n) = λ + nx (n = backlog, 1/x = ave. time between retransmissions)

• Note that an attempt suffers a collision if the previous attempt is notyet finished (ti-ti-1<1) or the next attempt starts too soon (ti+1-ti<1)

t t t1 2 3

t4

t5

Retransmission

New Arrivals

43! !

Collision


Throughput of Unslotted Aloha

• An attempt is successful if the inter-attempt intervals on bothsides exceed 1 (for unit duration packets)

– P(success) = e-g(n) e-g(n) = e-2g(n)

– Throughput (success rate) = g(n) e-2g(n)

– For max throughput at g(n) = 1/2, Throughput = 1/2e ~ 0.18

– Stabilization issues are similar to slotted aloha

– Advantages of unslotted aloha are simplicity and possibility ofunequal length packets


Splitting Algorithms

• More efficient approach to resolving collisions– Simple feedback (0,1,e)– Basic idea: assume only two packets are involved in a collision

Suppose all other nodes remain quiet until collision is resolved, andnodes in the collision each transmit with probability 1/2 until one issuccessful

On the next slot after this success, the other node transmits

The expected number of slots for the first success is 2, so the expectednumber of slots to transmit 2 packets is 3 slots

Throughput over the 3 slots = 2/3

– In practice above algorithm cannot really work Cannot assume only two users involved in collision Practical algorithm must allow for collisions involving unknown number

of users


Tree algorithms

• After a collision, all new arrivals and all backlogged packets notin the collision wait

• Each colliding packet randomly joins either one of two groups(Left and Right groups)

– Toss of a fair coin– Left group transmits during next slot while Right group waits

If collision occurs Left group splits again (stack algorithm) Right group waits until Left collision is resolved

– When Left group is done, right group transmits(1,2,3,4)

(1,2,3)

4

successcollision

1

success

(2,3)

collision

idle

collision

(2,3)

2 3

success success

Notice that after the idle slot, collision between (2,3) was sure to happen and could have been avoided

Many variations and improvementson the original tree splitting algorithm


Throughput comparison

• stabilized pure aloha T = 0.184 = (1/(2e))

• stabilized slotted aloha T = 0.368 = (1/e)

• Basic tree algorithm T = 0.434

• Best known variation on tree algorithm T = 0.4878

• Upper bound on any collision resolution algorithm with (0,1,e)feedback T <= 0.568

• TDM achieves throughputs up to 1 packet per slot, but the delayincreases linearly with the number of nodes


Carrier Sense Multiple Access (CSMA)

• In certain situations nodes can hear each other by listening to thechannel - “Carrier Sensing”

• CSMA: Polite version of Aloha– Nodes listen to the channel before they start transmission

Channel idle => Transmit Channel busy => Wait (join backlog)

– When do backlogged nodes transmit?

When channel becomes idle backlogged nodes attempt transmission withprobability qr= 1

Persistent protocol, qr= 1

Non-persistent protocol, qr< 1


CSMA

• Let τ = the maximum propagation delay on the channel– When a node starts/stops transmitting, it will take this long for all nodes

to detect channel busy/idle

• For initial understanding, view the system as slotted with "mini-slots" of duration equal to the maximum propagation delay

– Normalize the mini-slot duration to β = τ/Dtp and packet duration = 1

• Actual systems are not slotted, but this hypothetical systemsimplifies the analysis and understanding of CSMA

! <"">

minislotspacket

<----------- 1 ---------------->


Rules for slotted CSMA

• When a new packet arrives– If current mini-slot is idle, start transmitting in the next mini-slot– If current mini-slot is busy, node joins backlog– If a collision occurs, nodes involved in collision become backlogged

• Backlogged nodes attempt transmission after an idle mini-slotwith probability qr < 1 (non-persistent)

– Transmission attempts only follow an idle mini-slot– Each”busy-period” (success or collision) is followed by an idle slot

before a new transmission can begin

• Time can be divided into epochs:– A successful packet followed by an idle mini-slot (duration = β+1)– A collision followed by an idle mini-slot (duration = β+1)– An idle minislot (duration = β)


�Analysis of CSMA

• Let the state of the system be the number of backlogged nodes

• Let the state transition times be the end of idle slots– Let T(n) = average amount of time between state transitions when the

system is in state nT(n) = β + (1 - e-λβ (1-qr)n)

When qr is small (1-qr)n ~ e-qrn => T(n) = β + (1 - e-λβ−nq

r )

• At the beginning of each epoch, each backlogged node transmitswith probability qr

• New arrivals during the previous idle slot are also transmitted

• With backlog n, the number of packets that attempt transmissionat the beginning of an epoch is approximately Poisson with rate

g(n) = λβ + nqr


Analysis of CSMA

• The probability of success (per epoch) is

Ps = g(n) e-g(n)

• The expected duration of an epoch is approximately

T(n) ~ β + (1 - e-g(n) )

• Thus the success rate per unit time is

! < departure rate=g(n)e" g( n)

# +1" e" g( n)


Maximum Throughput for CSMA

• The optimal value of g(n) can again be obtained:

• Tradeoff between idle slots and time wasted on collisions

• High throughput when β is small

• Stability issues similar to Aloha (less critical)

Arrival rate

Departure rate1-!2!

!!2

g(n) = + nq"!r

g(n) ! 2" ! <1

1+ 2"


Unslotted CSMA

• Slotted CSMA is not practical– Difficult to maintain synchronization– Mini-slots are useful for understanding but not critical to the

performance of CSMA

• Unslotted CSMA will have slightly lower throughput due toincreased probability of collision

• Unslotted CSMA has a smaller effective value of β than slottedCSMA

– Essentially β becomes average instead of maximum propagationdelay


CSMA/CD

• CSMA with Collision Detection (CD) capability– Nodes able to detect collisions– Upon detection of a collision nodes stop transmission

Reduce the amount of time wasted on collisions

• Protocol:

– All nodes listen to transmissions on the channel

– When a node has a packet to send: Channel idle => Transmit Channel busy => wait a random delay (binary exponential backoff)

– If a transmitting node detects a collision it stops transmission Waits a random delay and tries again

Two way cable

WS WS WS WS WS WS


Time to detect collisions

• A collision can occur while the signal propagates between the twonodes

• It would take an additional propagation delay for both users todetect the collision and stop transmitting

• If τ is the maximum propagation delay on the cable then if acollision occurs, it can take up to 2τ seconds for all nodesinvolved in the collision to detect and stop transmission

WS WSττ = prop delay


Approximate model for CSMA/CD

• Simplified approximation for added insight

• Consider a slotted system with “mini-slots” of duration 2τ

• If a node starts transmission at the beginning of a mini-slot, by theend of the mini-slot either

– No collision occurred and the rest of the transmission will beuninterrupted

– A collision occurred, but by the end of the mini-slot the channelwould be idle again

• Hence a collision at most affects one mini-slot

2τ <−−>

minislotspacket<----------- 1 ---------------->


Analysis of CSMA/CD

• Assume N users and that each attempts transmission during afree “mini-slot” with probability p

– P includes new arrivals and retransmissions

P(i users attempt) = N

i

!

" # # $

% & & P

i(1' P )

N'i

P(exactly 1 attempt) = P(success) = NP(1-P )N-1

To maximize P(success),

d

dp[NP(1- P )

N- 1] = N(1-P )

N-1 'N(N'1)P(1' P )N'2

= 0

(Popt =1

N

( Average attempt rate of one per slot

( Notice the similarity to slotted Aloha


Analysis of CSMA/CD, continued

• Once a mini-slot has been successfully captured, transmissioncontinues without interruption

• New transmission attempts will begin at the next mini-slot afterthe end of the current packet transmission

P(success)=NP(1- p)N-1

= (1!1

N)

N!1

Ps = limit (N" #) P(success) = 1

e

Let X = Average number of slots per succesful transmission

P(X= i) = (1- Ps)i!1

Ps

$ E[X]=1

Ps

= e


Analysis of CSMA/CD, continued

• Let S = Average amount of time between successful packettransmissions

S = (e-1)2τ + DTp + τ

• Efficiency = DTp/S = DTp / (DTp + τ + 2τ(e-1))

• Let β = τ/ DTp => Efficiency ≈ 1/(1+4.4β) = λ < 1/(1+4.4β)

• Compare to CSMA without CD where

Ave time until start of next Mini-slot

Packet transmission timeIdle/collisionMini-slots

! <1

1+ 2"


Notes on CSMA/CD

• Can be viewed as a reservation system where the mini-slots areused for making reservations for data slots

• In this case, Aloha is used for making reservations during themini-slots

• Once a users captures a mini-slot it continues to transmit withoutinterruptions

• In practice, of course, there are no mini-slots

– Minimal impact on performance but analysis is more complex


CSMA/CD examples

• Example (Ethernet)– Transmission rate = 10 Mbps– Packet length = 1000 bits, DTp = 10-4 sec– Cable distance = 1 mile, τ = 5x10-6 sec

– ➨ β = 5x10-2 and E = 80%

• Example (GEO Satellite) - propagation delay 1/4 second– β = 2,500 and E ~ 0%

• CSMA/CD only suitable for short propagation scenarios!

• How is Ethernet extended to 100 Mbps?

• How is Ethernet extended to 1 Gbps?


Token rings

• Token rings were developed by IBM in early 1980’s

• Token: a bit sequence– Token circulates around the ring

Busy token: 01111111 Free token: 01111110

• When a node wants to transmit– Wait for free token– Remove token from ring (replace with busy token)– Transmit message– When done transmitting, replace free token on ring

– Nodes must buffer 1 bit of data so that a free token can bechanged to a busy token

• Token ring is basically a polling system Token does the polling

Token Ring


TOKEN BUSES

• Special control packet serves as a token• Nodes must have token to transmit• Token is passed from node to node in some order

– Conceptually, a token bus is the same as a token ring

– When one node finishes transmission, it sends an idle token to thenext node (by addressing the control packet properly)

– Similar to a polling system• Issues

– Efficiency lower than token rings due to longer transmission delayfor the packets and longer propagation delays

– Need protocol for joining and leaving the bus

WS WS WS WS WS WS


Large propagation delay(satellite networks)

• Satellite reservation system– Use mini-slots to make reservation for longer data slots– Mini-slot access can be inefficient (Aloha, TDMA, etc.)

• A crude approximation: delay is 3/2 times the propagation delayplus ideal queueing delay.

1 2 3 4 5

A = mv

ReservationInterval

DataInterval

ReservationInterval

Frame

Res Data Res Data DataRes Res

Arrival

Wait for Reser-vation Interval

Propagation Delay

Wait for AssignedData Slot

Transmit


Satellite Reservations

• Frame length must exceed round-trip delay– Reservation slots during frame j are used to reserve data slots in

frame j+1– Variable length: serve all requests from frame j in frame j+1

Difficult to maintain synchronization Difficult to provide QoS (e.g., support voice traffic)

– Fixed length: Maintain a virtual queue of requests• Reservation mechanism

– Scheduler on board satellite– Scheduler on ground– Distributed queue algorithm

All nodes keep track of reservation requests and use the same algorithm tomake reservation

• Control channel access– TDMA: Simple but difficult to add more users– Aloha: Can support large number of users but collision resolution

can be difficult and add enormous delay


Aloha Reservations

• Use Aloha to capture a slot• After capturing a slot user keeps the slot until done

– Other users observe the slot busy and don’t attempt• When done other users can go after the slot

– Other users observe the slot idle and attempt using Aloha• Method useful for long data transfers or for mixed voice and data


Packet multiple access summary

• Latency: Ratio of propagation delay to packet transmission time– GEO satellite example: Dp = 0.5 sec, packet length = 1000 bits, R = 1Mbps

Latency = 500 => very high– LEO Satellite example: Dp = 0.1 sec

Latency = 100 => still very high– Over satellite channels data rate must be very low to be in a low latency

environment• Low latency protocols

– CSMA, Polling, Token Rings, etc.– Throughput ~ 1/(1+aα), α = latency, a = constant

• High latency protocols– Aloha is insensitive to latency, but generally low throughput

Very little delays– Reservation system can achieve high throughput

Delays for making reservations– Protocols can be designed to be a hybrid of Aloha and reservations

Aloha at low loads, reservations at high loads

MIT

Switches, Routers and Networks

MIT

Overview

• Introduction• Routing and switching:

– Switch fabrics :– Basics of switching– Blocking– Interconnection examples– Complexity– Recursive constructions

• Interconnection routing• Buffering - input and output• Local area networks (LANs)• Metropolitan area networks (MANs)• Wide area networks (WANs)• Trends

MIT

Introduction

• Data networks generally evolve fairly independently for differentapplications and are then patched together – telephony, variety ofcomputer applications, wireless applications

• IP is a large portion of the traffic, but it is carried by a variety ofprotocols throughout the network

• Voice is still the application that has determined many of theimplementation issues, but its share is decreasing and voice isincreasingly carried over IP (voice over IP)

• Voice-oriented networks are not very flexible, but are very robust• IP very successful because it is very flexible, but increasingly

there is a drive towards enhancing the reliability of services• How do all of these network types and requirements fit together?

MIT

Networks

WANMAN

MAN

LANLAN

LAN

• LANs serve a wide variety of services and attach to MANsor maybe directly to WANs

• The two main purposes of a networks are:– Transmission across some distance: this involves

amplification or regeneration (generally code-assisted)– The establishment of variable flows: switching and

routing

SAN

LAN

MIT

Switching and Routing

• Switching is generally the establishment of connections on a circuitbasis

• Routing is generally the forwarding of traffic on a datagram basis• Routing requires switching but not vice-versa – routing uses

connections which are permanently or temporarily set up to in orderto forward datagrams (those datagrams may be in circuit form, forinstance VPs and VCs)

MIT

Packet routers

• A packet switch consists of a routing engine (table look-up), a switchscheduler, and a switch fabric.

• The routing engine looks-up the packet address in a routing table anddetermines which output port to send the packet.– Packet is tagged with port number– The switch uses the tag to send the packet to the proper output port

MIT

Switch fabrics

• Simplest switch fabric is simply a shared bus– Most of the processing is done in line cards

• Route table look-up• Line cards buffer the packets• Line card send packets to proper output

– Bus bandwidth must be N times LC speed (N ports)• In general a switch fabric replaces the bus• Switch fabrics are created from certain building blocks of

smaller switches arranged in stages• Simplest switch is a 2x2 switch, which can be either in the

through or crossed position

Computer

Bus

LC LC LC LC

MIT

Definitions

• A connection state is a mapping from the array of inputs to thatof outputs; connections are either point-to-point or multicast

• Basic switch building blocks are:– the distributor

– the concentrator

– the 2x2 2-state point-to-point switch (switching cell)

0

10

0

10

0

10

0

10

0

10

0 01 1

0 01 1

MIT

Building up

• Interconnection network: finite collection of nodes togetherwith a set of interconnection lines such that– every node is an object with an array of inputs and an

array of outputs– an interconnection line leads from an output of one node

to an input of another node– every I/O of a node is incident with at most one

interconnection line– an I/O is called external if it is not incident with any

interconnection line• A route from an external input to an external output is a chain

of distinct (a0, b0, a1, b1, …, ak, bk) where a0 and bk areexternal, bj-1 is interconnected to aj

MIT

Building up

• An interconnection network is called a switching networkwhen:– every node qualifies to be a switch through proper

specification of connection states– the network is routable (there exists a route from every

external input to every external output)– an ordering is specified on external inputs and on external

outputs• Unique routing interconnection networks: all routes from an

external input to an external output are parallel, that is (a0, b0,a1, b1, …, ak, bk) and (a0, b’0, a’1, b’1, …, a’k, bk) are such that aj,a’j reside on the same nodes and bj, b’j reside on the same node

• Otherwise: alternate routing

MIT

Blocking

• A mxn unique routing network is called a nonblocking networkif for any integer k < min(m,n)+1, any k external inputs, any kexternal outputs and pairing between these external I/O, thereexist k disjoint routes for the matched pairs

• For a routable network, the same property is that ot arearrangeably nonblocking, or rearrangeable network

• An interconnection network is strictly non-blocking if requestsfor routes are always granted under the rule of arbitrary routeselection, wide-sense non-blocking if there exists an algorithmfor route selection that grants all requests

rearrangeableWSnon-blockingnon-blocking

MIT

Blocking, Multi-stage networks

• Main connection between rearrangeability and non-blockingproperty is given by the following theorem:

A switching network composed of non-blocking switches is rearrangeable iff itconstructs a non-blocking switch

• A common means of building interconnection networks is touse a multi-stage architecture:– every interconnection line is between two stages– every external input is on a first-stage node– every external output is on a final-stage node– nodes within each stage are linearly ordered

MIT

Interconnection networks

• N input, Log(N) stages with N/2 modules per stageExample: Omega (shuffle exchange network)

• Notice the order of inputs into a stage is a shuffle of the outputsfrom the previous stage: (0,4,1,5,2,6,3,7)

• Easily extended to more stages• Any output can be reached from any input by proper switch settings

– Not all routes can be done simultaneously– Exactly one route between each OD pair

MIT

Interconnection networks

• Another example of a multi-stage interconnection network• Built using the basic 2x2 switch module• Recursive construction

– Construct an N by N switch using two N/2 by N/2 switches and anew stage of N/2 basic (2x2) modules

– N by N switch has Log2(N) stages each with N/2 basic (2x2)modules

MIT

Complexity issues

• There are many different parameters that are used to considerthe complexity of an interconnection network

• Line complexity: number of interconnection lines• Node (cell) complexity: number of small nodes (mxn where

m < 3 and n < 3)• Depth: maximum number of nodes on a route (assuming an

acyclic interconnection network)• Entropy of a switch: log of the number of connections states• What relations exist between complexity and the capabilities

of a switch?

MIT

Complexity

• The depth of a mxn routable interconnection network is atleast max(log(m), log(n)).

• Proof: for a depth d, there are at most 2d external outputs.Since we have routability, n< 2d+1 and m< 2d+1 .

• When a switching network is composed of 2-state switches,the component complexity of the network is at least theentropy of the switch

• Proof: for E the number of switches, there are 2E ways toform a combination of one connection state in every node.Each combination corresponds to at most one connectionstate in the node.

MIT

Complexity

• When a nxn rearrangeable network is composed of smallnodes, its component complexity is at least log(N!)

• Proof: if we take every small node to be replaced by a 2-statepoint-to-point switch, then we have a non-blocking switch.Thus, there is a different connection state for everyone of then! one-to-one mapping between the n inputs and the noutputs. We now use the relation for networks composed of2-state switches.

• Note: using Stirling’s formula, we can obtain an approximatesimple bound for component complexity

MIT

Complexity

• Component complexity:

• Relation between line and component complexity: component complexity +mn = line complexity +m + n

MIT

Complexity

• If a mxn nonblocking network is composed of n12 1x2 nodes, n212x1 nodes, n22 cells, plus possibly crosspoints (edges), then

n12 + n21 + 4 n22 = 2mn - m - n• Corollary: a nxn non-blocking network composed of small

nodes has component complexity at least 0.5(n2 - n)• Note: directed acyclic graphs can be seen as a special case of a

network - a crosspoint network.• We have basic complexity properties, but how do we build

networks?

MIT

Recursive 2-stage construction

• 2-stage interconnection with parameters m and n is composedof n mxm input nodes and m nxn output nodes interconnectedby a coordinate interchange (static)

• Constructions using trees:

• Basic blocks need not be 2x2, trees need not be balanced

16x16

4x4 4x4

2x2 2x2 2x2 2x2

Divide and conquer

60x60

6x6 10x10

2x2 3x3 5x5 2x2

MIT

Benes approach

• A three stage approach in which we use as the middle stage twonetworks of size 2n-1 x 2n-1 to build a network of size 2n x 2n

2n-1 x 2n-1

2n-1 x 2n-1

.

.

.

.

.

.2n-1 cells

2n-1 cells

MIT

Generalized 3-stage approach

• We denote by [nxm, rxp, mxq] the 3-stage network with rnxm input nodes, m rxp middle nodes, p mxq output nodessuch that– output y of input node x is linked to input x of of middle

node y– output u of middle node y is linked to input y of output

node u• Rearrangeability theorem: the 3-stage network is

rearrangeable iff

• It is strictly non-blocking iff

MIT

Maximum matchings

• Algorithms for finding maximum matching exist• The best known algorithms takes O(N2.5) operations

– Too long for large N• Alternatives

– Sub-optimal solutions– Maximal matching: A matching that cannot be made

any larger for a given backlog matrix– For previous example:

(1-1,3-3) is maximal(2-1,1-2,3-3) is maximum

• Fact: The number of edges in a maximal matching ≥ 1/2the number of edges in a maximum matching

MIT

Self-routing

• Use the switch fabric for packet routing• Use a tag: n bit sequence with one bit per stage of the

network– E.g., Tag = b3b2b1

• Module at stage i looks at bit i of the tag (bi), and sends thepacket up if bi=0 and down if bi=1

• In omega network, for destination port with binary addressabc the tag is cba– Example: output 100 => tag = 001– Notice that regardless of input port, tag 001 will get you

to output 100• What happens when packets cannot be forwarded to the right

output for the given setting of the switching fabric?

MIT

MIT

Interconnection analysis for routing

• Assume no buffering at the switches• If two packets want to use the same port one of them is

dropped• Suppose switch has m stages• Packet transmit time = 1 slot (between stages)• New packet arrival at the inputs, every slot

– Saturation analysis (for maximum throughput)– Uniform destination and distribution independent from

packet to packet

MIT

Interconnection throughput

• Let P(m) be the probability that a packet is transmitted on astage m link, P(0) = 1

• P(m+1) = 1 – P(no packet on stage m+1 link (link c) )= 1 – P(neither inputs to stage m+1 chooses this output)

• Each input has a packet with probability P(m) and that packetwill choose the link with probability 1/2. Hence,

• We can now solve for P(m) recursively• For an m stage network, throughput (per output link) is P(m),

which is the probability that there is a packet at the output

P(m +1) = 1! (1 !1

2P(m))

2

MIT

Distributed buffer

• Modular Architecture

• Switch buffers: None, at input, or at output of each moduleSwitch fabric consists of many 2x2 modules

MIT

Contention and buffering

• Two packets may want to use the same link at the sametime (same output port of a module): hot spot effect

• Solution: BufferingThroughput of interconnect network

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

stages

th

rou

gh

pu

t

MIT

Multi-stage architecture

• Throughput is significantly improved by buffers at the stages– Buffers increase delay– Tradeoff between delay and throughput

• Advantages: modular, scalable, bus (links) only needs to be as fastas the line cards

• Disadvantages– Delays for going through the stages

• Cut-through possible when buffers empty– Decreased throughput due to internal blocking

• Alternatives: Buffers that are external to the switch fabric– Output buffers– Input buffers

MIT

Output buffer architecture

• As soon as a packet arrives, it is transferred to the appropriate outputbuffer

• Assume slotted system (cell switch)• During each slot the switch fabric transfers one packet from each input

(if available) to the appropriate output– Must be able to transfer N packets per slot– Bus speed must be N times the line rate– No queueing at the inputs

• Buffer at most one packet at the input for one slot

MIT

Queueing Analysis

• If external arrivals to each input are Poisson (average rate ),each output queue behaves as an M/D/1 queue

– packet duration equaling one slot• The average number of packets at each output is given by

(M/G/1 formula):

• Note that the only delay is due to the queueing at the outputsand none is due to the switch fabric

A

X = X2

= 1

MIT

Advantages/Disadvantages ofOutput buffer architecture

• Advantages: No delay or blocking inside switch• Disadvantages:

– Bus speed must be N times line speed• Imposes practical limit on size and capacity of switch

• Shared output buffers: output buffers are implemented inshared memory using a linked list– Requires less memory (due to statistical multiplexing)– Memory must be fast

MIT

Input buffer

• Packets buffered at input rather than output, so switch fabricdoes not need to be as fast

• During each slot, the scheduler established the crossbarconnections to transfer packets from the input to the outputs– Maximum of one packet from each input– Maximum of one packet to each output

MIT

Throughput analysis of input queued switches

• Head of line (HOL) blocking – when the packets at the head of twoor more input queues are destined to the same output, only one canbe transferred and the others are blocked

• HOL blocking limits throughput because some inputs (consequentlyoutputs) are kept idle during a slot even when they have otherpacket to send in their queue

• Consider an NxN switch and again assume that inputs are saturated(always have a packet to send)

• Uniform traffic => each packet is destined to each output with equalprobability (1/N)

• Now, consider only those packets at the head of their queues (thereare N of them!)

MIT

Throughput analysis, continued

• Let be the number of HOL packets destined to node iat the end of the mth slot

• Where = number of new HOL messages addressed to node i thatarrive to the HOL during slot m. Now,

• Where = number of HOL messages that departed during the m-1slot = number of new HOL arrivals

• As N approaches infinity, becomes Poisson of rate C/Nwhere C is the average number of departures per slot

Qm

i

Qm

i= max(0,Q

m!1

i+ A

m

i!1)

Am

i

P(Am

i= l) =

Cm!1

l

"

# $

%

& ' (1/ N)l(1 !1/ N)Cm! 1! l

Cm!1

Am

i

MIT


• In steady-state, Qi behaves as an M/D/1 of rate and, asbefore,

• Notice however that the total number of packets addressedto the outputs is N (number of HOL packets). Hence, =>

• We can now solve, using the quadratic equation to obtain:

A

Qi

i=1

N

! = N

A = utilization = 2 ! 2 " 0.58

MIT

Summary of input queued switches

• The maximum throughput of an input queued switch, islimited by HOL blocking to 58% ( for large N)

– Assuming uniform traffic and FCFS service

• Advantages of input queues:– Simple– Bus rate = line rate

• Disadvantages: Throughput limitation

MIT

Overcoming HOL blocking

• If inputs are allowed to transfer packets that are not at thehead of their queues, throughput can be substantiallyimproved (not FCFS)

Example:

How does the scheduler decide which input to transfer towhich output?

MIT

Backlog matrix

• Each entry in the backlog matrix represent the number of packets ininput i’s queue that are destined to output j

• During each slot the scheduler can transfer at most one packet from eachinput to each output– The scheduler must choose one packet (at most) from each row, and

column of the backlog matrix– This can be done by solving a bi-partite graph matching algorithm– The bi-partite graph consists of N nodes representing the inputs and

N nodes representing the outputs

MIT

Bi-partite graph representation

• There is an edge in the graph from an input to an output if there is apacket in the backlog matrix from that input to that output

• For previous backlog matrix, the bi-partite graph is:

• A matching is a set of edges, such that no two edges share a node: amatching in the bi-partite graph is equivalent to a set of packets suchthat no two packets share a row or column in the backlog matrix

• A maximum matching is a matching with the maximum possiblenumber of edges: a maximum matching is equivalent to the largest setof packets that can be transferred simultaneously

MIT

Maximum matchings


– Too long for large N• Alternatives

– Sub-optimal solutions– Maximal matching: A matching that cannot be made

any larger for a given backlog matrix– For previous example:

(1-1,3-3) is maximal(2-1,1-2,3-3) is maximum

• Fact: The number of edges in a maximal matching ≥ 1/2the number of edges in a maximum matching

MIT

Achieving 100% throughputin an input queued switch

• Finding a maximum matching during each time slot does noteliminate the effects of HOL blocking– Must look beyond one slot at a time in making scheduling

decisions• Definition: A weighted bi-partite graph is a bi-partite graph

with costs associated with the edges• Definition: A maximum weighted matching is a matching

with the maximum edge weights• Theorem: A scheduler that chooses during each time slot the

maximum weighted matching where the weight of link (i,j) isequal to the length of queue (i,j) achieves full utilization(100% throughput)

– Proof: see “Achieving 100% throughput in an input queued switch” by N. McKeown, et. al., IEEETransactions on Communications, Aug. 1999.

MIT

General relation with bipartite matching

• Stability of infinite input-buffered switch iff we candecompose the traffic as a convex linear combination of 0,1sub-stochastic matrices

• Birkhoff-von Neumann principle• This links packets and flows to circuits• Corollary: if we know the traffic matrix well, then we can

provide stable service through a TDM schedule• Delay effects?• Robustness to poor knowledge of the traffic?

MIT

LANs

• The driver behind LANs can be roughly thought of asincreasing the reach and sharing of a bus

• Traditional Ethernet: CSMA/CD, shared• Other approach: token ring, for instance Fiber data distributed

interface (FDDI)

• Switched networks:Lines are not shared but gothrough a router/switch

User1

User1

Shared ring

MIT

IEEE/ANSI 802 standards

802.3:CSMA/CD

Ethernet

802.4Token

bus

802.5 Token Ring

802.6DQDB MAN

802.9IIS

LAN

802.11Wireless

LAN

802.12DPAM

Distributed queuedual bus

Integrated services

Demand priorityAccess method

802.1 bridging

802.2 logical link control

Each of the 802.3-12 have both a Medium access and a physical standard

MIT

Evolution of Ethernet

• Ethernet emerged form the ideas of shared media such as ALOHA and the firstEthernet was built at Xerox Parc in the early 1970s

• Ethernet s not completely 802.3, but a close approximation (there are somedifferences in the packet)

• Ethernet node:• MAC enforces CSMA/CD and performs:

– Transmit and receive message data encapsulation:• Framing• Addressing• Error detection

– Media access management:• Medium allocation (collision avoidance)• Contention resolution (collision handling)

• PLS: physical signaling, Manchester encoding• AUI (attachment unit interface) manages data in (DI),

data out (DO) and control in (CI)• Medium attachment unit (MAU): transmits and receives data,

loops data back from DO to DI to indicate valid Tx and Rx path,detects collisions, sends signal quality error signal, performs jabber function, checks link integrity

Host system bus

MAC

System interface

PLSDI DO CI

DI DO CIMAU

RG 58 COAX

MIT

Increasing Ethernet bandwidth – the first step

• The first Ethernet went up to 10 Mbs – 10BASE-T, over phonegrade twisted pair, with a repeater in the middle of a starconfiguration acting as a virtual shared medium (also traditional10Base5 and Cheapernet 10BASE2 on thick and thin coax,respectively were laid out)

• 10Base-T over fiber was developed, extending the distancebetween MAUs to 2 km instead of 500 m in coax

• 1990: the Etherswitch was marketed by Kalpana to boost LANperformance rather than as a bridge to interconnect differentLANs and in 1993 full-duplex interconnect was also introducedby Kalpana

• Still each port could only deliver 10 Mbps, the option for higher(100 Mbps) connection was FDDI, which was expensive

MIT

Fast Ethernet

• In 1992, Grand Junction introduced 100 Mbps Ethernet• Standardization was done by the Fast Ethernet Alliance, while the

IEEE struggled between 802.3 and a demand-priority camp, whichcreated the 802.12 group

• Later 803.2u standardized 100BASE-T• Main differences between 10BASE-T and 100BASE-T:

– No more mixing segments (coax with multiple devices attached), all cabling ispoint to point between terminal equipment or repeaters

– Shorter distances – 100 m for Cat 5, Cat 3 and 130 m for fiber (160 m if allfiber network)

– Kept the MAC but changed elements below to adapt ot 100 Mbps - replacedthe AUI with the media independent sublayer, added a reconciliation sublayer(going from bit-derial to nibble-serial), went from Manchester encoding toNRZ

• 10 GigE is emerging as a new standard http://www.10gea.org/Tech-whitepapers.htm

MIT

10 Gigabit Ethernet

•10 GigE is emerging as a new standard• The standard is being developed with SONET interoperability in mind with a view towards expansion in the MAN and WAN end-to-end Ethernet arena• In particular, the load will be be matched to OC-192 loads•Task force 802.3ae is in charge of developing 10 GE standard•Also 10 Gigabit Ethernet alliance http://www.10gea.org/

MIT

Evolution to switched LANs

• VLANS were introduced to allow for smaller broadcast group:– the standardization efforts have not yet yielded interoperable VLANs, they are

still proprietary solutions– VLANs require a frame extension (802.3ac) to convey VLAN information via

tagging (802.1Q) (2 tags of 16 bits each), approved in 1998

• Layer 3 switches implement some routing in hardware:– Routers were generally used for interconnecting LANs and for remote WAN

connections– Switches traditionally had little intelligence but were very fast– Layer 3 switches still perform layer 2 switching but also some routing

functionality in ASICs– They also implement VLANs– Generally support only IP

MIT

The next step in Ethernet- Gigabit Ethernet

• The Gigabit Ethernet Alliance (May 1996) started the push for GigabitEthernet, mostly standardized as 802.3z in 1998

• Main characteristics:– The MAC itself was modified so that there is 200 m network span with a single

repeater– The MII was changed to GMII, Tx and Rx data paths widened to 8 bits– Adoption of 8bit/10bit fibre channel encoding– Carrier extension: extending or padding from 64-byte minimum to 512-byte

minimum to maintain compatibility– Frame bursting to enhance efficiency:

worst-case efficiency for 100 Mb/s CSMA/CD is for1000 Mb/s with CSMA/CD is

Minimum packet length

Preamble length

Inter-frame gap

MIT

Frame-bursting for Efficiency

• Frame bursting to enhance efficiency• Worst-case efficiency for 100 Mb/s CSMA/CD is

• For 1000 Mb/s with CSMA/CD is

• If we allow n frames to be transmitted in a burst after the first framethen worst-case efficiency is

• Efficiency gains beyond 65,536 bits is minimal and is about 72% atthat value

Minimum packet length

Preamble length

Inter-frame gap

Slot time

MIT

Another LAN application: storage access

• In open systems world, dominant I/O technology is small computersystem interface (SCSI), which transfers data in blocksstandardized in 1986 as ANSI X3T9

• SCSI drawbacks:– Two or more I/O controllers cannot easily share SCSI devices on the same

I/O bus, so a single server controls connections between users and their data– Address on an I/O bus: 8 or 16 addresses depending on implementation– Distance 25 m

Storage devices

SCSI channels

server

MIT

A new type of LAN – the SAN

• In the same way that early LANs developed from extending thebus, the requirement for more storage has driven extending theSCSI interface to many devices and eventually replacing asingle storage device with a full network, the storage areanetwork (SAN)

• Based on Fibre Channel protocol (FC) fiber channel:– Gigabit per second bandwidth (1063 Mbps) and theoretically

up to 4 Gbps– Allows SCSI in serial form rather than the parallel form

usually found in SCSI (also supports HIPPI and IPI I/Oprotocols)

– Distance of up to 10 km– 24-bit address identifier – up to 16 million ports

MIT

FC

• Upper level protocolsinclude application,device drivers, operatingsystems

• Common services arestriping, hunt groups,multicast

• Framing: frames of upto 2112 bytes,sequences (one or moreframes), exchanges (unior bidirectional set ofnon-concurrentsequences, packets (oneor more exchanges)

Upper level protocols

FC4 Protocol mappings

FC3 Common services

FC2 Framing protocol

FC1 Encode/decode

FC0 Physical

Port

Leve

lN

ode

Leve

l

MIT

Different types of FC SAN architectures

• Point-to-point

• Arbitrated loop topology:– up to 126 devices in a serial loop

configuration– Each port discovers when

it has been attached– No collisions– Fair access: every port wanting

to initiate traffic gets to do sobefore another port gets a

second shot

hub

MIT

Different types of FC SAN architectures

• Fabric topology

• A common fabric topology is cascaded switches

FC switch

Host I/O controller

MIT

This is not a shared bus!

Commerzbank Brocade set-up

MIT

Other alternatives to SANs

• Embedded disk drives• Directly attached storage attached by SCSI directly, possibly

shared among servers• Network attached storage is in front of the server, directly attached

to the network, rather than behind the server as a SAN– Protocol is generally NFS vs. FC for SAN– Network is Ethernet vs. FC for SAN– Source and target are client/server or server/server vs.

server/device for SAN– Transfers files vs. device blocks for SAN– Connection is direct on network vs. I/O bus or channel on

server for SAN– Has an embedded file system

MIT

High availability in the enterprise

Tx Rx Tx Rx

Tx Rx Tx Rx

Secondary switchPrimary Switch

Primary

Primary

Secondary

Secondary

Inter-switchconnection

GigE or FC

GigE or FC

MIT

MANs

• MANs are a fuzzy area since they may operate as largeLANs or simply as the last leg of a WAN

• Certain protocols are particularly oriented towards MANs,such a DQDB, dual bus either folded or not folded :– Exhibited certain issues with utilization fairness– Not very flexible in its layout architecture

Head end

Headendnode node node Dual bus

Head end node node node Folded bus

MIT

Resilient Packet Ring

• Rings for packet access in the MAN• Resilient packet ring alliance (RPR) and IEEE working group

802.17 (started December 2000)• Oriented towards IP• Recovery is done using traditional self-healing ring approach• Maintains the same architecture as SONET rings and FDDI,

but changes the MAC

MIT

WANs

• WANs are predominantly implemented over optical networks• The underlying protocol is SONET (synchronous optical network)

or SDH in Europe and Japan (synchronous digital hierarchy)• Synchronous, so framing is in terms of timing• Lowest-speed SONET runs at STS-1, 51.84 Mbps• STS frames may be concatenated with a single header, which

contains pointers to the different headers of the STS frames• SONET provides very tight requirements on reliability• Typical implementations are UPSR or BLSR• Recovery must occur within 50 ms, detection of a problem occurs

within 2.4 microseconds

MIT

WANs

• WANs are increasingly dense and require extensive networkmanagement

• Provisioning across WANs in short time is a growing as thereselling market becomes more fluid

• WANs are increasingly called upon to perform functionsheretofore reserved for LANs or MANs, so there isincreasing convergence

• Speed per wavelength is now 0C-48 (2.5 Gbps), OC-192 (10Gbps) possibly going towads 40 Gbps

MIT

Access to the Optical Infrastructure

• Two trends in optical access:

– IP, GE being pushed closer to the core– streaming media pushing core-type traffic closer to the edge

• How should access be architected:– role of network management– types of nodes

Core: SONET

x on WDM

MAN:SONET, ATM

x on WDM

Local:GE, FC, ATM,

TCP/IP

Access: MPLS or other encapsulation


Fast packet switching

Eytan ModianoMassachusetts Institute of Technology


Packet switches

• A packet switch consists of a routing engine (table look-up), aswitch scheduler, and a switch fabric.

• The routing engine looks-up the packet address in a routing tableand determines which output port to send the packet.– Packet is tagged with port number– The switch uses the tag to send the packet to the proper output port


First Generation Switches

• Computer with multiple line cards– CPU polls the line cards– CPU processes the packets

• Simple, but performance is limited by processor speeds and busspeeds

• Examples: Ethernet bridges and low end routers


Second Generation switches

• Most of the processing is now done in the line cards– Route table look-up, etc.– Line cards buffer the packets– Line card send packets to proper output port

• Advantages: CPU and main Memory are no longer the bottleneck

• Disadvantage: Performance limited by bus speeds– Bus BW must be N times LC speed (N ports)

• Example: CISCO 7500 series router


Third generation switches

• Replace shared bus with a switch fabric• Performance depends on the switch fabric, but potentially can

alleviate the bus bottleneck

N by N

SWITCH FABRIC

Input LC

Input LC

Input LC

Output LC

Output LC

Output LC

Controller


Input buffer architecture

• Packets buffered at input rather than output– Switch fabric does not need to be as fast

• During each slot, the scheduler established the crossbarconnections to transfer packets from the input to the outputs– Maximum of one packet from each input– Maximum of one packet to each output

• Head of line (HOL) blocking – when the packet at the head of twoor more input queues is destined to the same output, only one canbe transferred and the other is blocked


Throughput analysis of input queued switches

• HOL blocking limits throughput because some inputs(consequently outputs) are kept idle during a slot even when theyhave other packet to send in their queue

• Consider an NxN switch and again assume that inputs aresaturated (always have a packet to send)

• Uniform traffic => each packet is destined to each output withequal probability (1/N)

• Now, consider only those packets at the head of their queues(there are N of them!)



• Let be the number of HOL packets destined to node i at theend of the mth slot

• Where

= number of new HOL messages addressed to node i that arriveto the HOL during slot m. Now,

• Where

= number of HOL messages that departed during the m-1 slot =number of new HOL arrivals

• As N approaches infinity, becomes Poisson of rate C/N where Cis the average number of departures per slot

Qm

i

Qm

i= max(0,Q

m!1

i+ A

m

i!1)

Am

i

P(Am

i= l) =

Cm!1

l

"

# $

%

& ' (1/ N)l(1 !1/ N)Cm! 1! l

Cm!1

Am

i



• In steady-state, Qi behaves as an M/D/1 of rate and,

• Notice however that the total number of packets addressed to the outputsis N (number of HOL packets). Hence,

• =>

We can now solve, using the quadratic equation to obtain:

A

Qi=2A ! (A )

2

2(1 ! A )

Qi

i=1

N

! = N Qi=2A ! (A )

2

2(1 ! A )= 1

A = utilization = 2 ! 2 " 0.58


Summary of input queued switches

• The maximum throughput of an input queued switch, is limited byHOL blocking to 58% ( for large N)

– Assuming uniform traffic and FCFS service

• Advantages of input queues:– Simple– Bus rate = line rate

• Disadvantages: Throughput limitation


Overcoming HOL blocking

• If inputs are allowed to transfer packets that are not at the head oftheir queues, throughput can be substantially improved (notFCFS)

Example:

• How does the scheduler decide which input to transfer to whichoutput?


Backlog matrix

• Each entery in the backlog matrix represent the number ofpackets in input i’s queue that are destined to output j

• During each slot the scheduler can transfer at most one packetfrom each input to each output– The scheduler must choose one packet (at most) from each row, and

column of the backlog matrix– This can be done by solving a bi-partite graph matching algorithm– The bi-partite graph consists of N nodes representing the inputs and

N nodes representing the outputs

1

2

3

input

output

1 2 3

3 3

2 0

2

0

0

0 0


Bi-partite graph representation

• There is an edge in the graph from an input to an output if there is apacket in the backlog matrix to be transferred from that input to thatoutput– For previous backlog matrix, the bi-partite graph is:

• Definition: A matching is a set of edges, such that no two edges sharea node– Finding a matching in the bi-partite graph is equivalent to finding a set of

packets such that no two packets share a row or column in the backlogmatrix

• Definition: A maximum matching is a matching with the maximumpossible number of edges– Finding a maximum matching is equivalent to finding the largest set of

packets that can be transferred simultaneously


Maximum Matchings


– Too long for large N

• Alternatives– Sub-optimal solutions– Maximal matching: A matching that cannot be made any larger for a

given backlog matrix

– For previous example:

(1-1,3-3) is maximal

(2-1,1-2,3-3) is maximum

• Fact: The number of edges in a maximal matching ≥ 1/2 thenumber of edges in a maximum matching


Achieving 100% throughputin an input queued switch

• Finding a maximum matching during each time slot does noteliminate the effects of HOL blocking– Must look beyond one slot at a time in making scheduling decisions

• Definition: A weighted bi-partite graph is a bi-partite graph withcosts associated with the edges

• Definition: A maximum weighted matching is a matching with themaximum edge weights

• Theorem: A scheduler that chooses during each time slot themaximum weighted matching where the weight of link (i,j) is equal tothe length of queue (i,j) achieves full utilization (100% throughput)

– Proof: see “Achieving 100% throughput in an input queued switch” byN. McKeown, et. al., IEEE Transactions on Communications, Aug. 1999.