® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

®

Line Rate Packet Classification and Scheduling

Line Rate Packet Classification and Scheduling

Michael Kounavis (Intel)Michael Kounavis (Intel)

Alok Kumar (Intel)Alok Kumar (Intel)

Raj Yavatkar (Intel)Raj Yavatkar (Intel)

Harrick Vin (U. Texas, Austin)Harrick Vin (U. Texas, Austin)

October 26, 2005October 26, 2005

®

Packet ClassificationPacket Classification

• 3 • Communications TechnologyCommunications TechnologyLabLab

Tutorial SummaryTutorial Summary PART I: PART I:

Understanding the problemUnderstanding the problem

PART II: PART II: State-of-the-artState-of-the-art

PART III: PART III: Observations on real world classifiersObservations on real world classifiers

PART IV: PART IV: Two stage packet classification using Most Two stage packet classification using Most

Specific Filter Matching and Transport Level Specific Filter Matching and Transport Level Sharing Sharing

®

PART I: Understanding the ProblemPART I: Understanding the Problem


Problem StatementProblem Statement Packet classifiersPacket classifiers

Lists of rulesLists of rules

Rules: <priority, predicate, action> tripletsRules: <priority, predicate, action> triplets

Single Match ProblemSingle Match Problem Find the highest priority rule that matches with Find the highest priority rule that matches with

a packeta packet

Multiple Match ProblemMultiple Match Problem Find all rules that match with a packetFind all rules that match with a packet


A Rule DatabaseA Rule Database

PERMIT* 147.101.* 1040-1070 *

DENY* 128.151.* * 2110-2150

DENY132.* 153.* * ftp

Src. IPSrc. IP Dest. IPDest. IPSrc. Src. PortPort

Dst. Dst. PortPort ActionAction

PredicatePredicate

Priority LevelPriority Level

1

2

3

a rule


TCP

A RuleA Rule

PERMIT* 147.101.* 2140-2170 *

Src. IP Dest. IPSrc. Port

Dst. Port Action

Predicate

Protocol

port ranges: arbitraryexamples:eq. 1050gt. 1024range 3000-4000

IP address ranges: IP prefixesexamples:128.59.67.0/255.255.255.0128.59.208.0/255.255.240.0128.59.0.0/255.255.0.0

Exact Value


An IP PrefixAn IP Prefix

A A range of values in a single dimension

128.67.0.0 128.67.255.2550.0.0.0

Src. IP address

range of values associated with the

Prefix 128.67.*


An Arbitrary RangeAn Arbitrary Range

A A range of values in a single dimension

2140 31400

Dst. Port

range of values between 2140 and 3140


An Exact ValueAn Exact Value

A A Specific Number

0

Protocol Field

1 6 17

TCP UDPICMP


A Source-Destination IP Prefix Pair A Source-Destination IP Prefix Pair

Src. IP

Dst. IP132.59.0.0

132.59.255.255

128.67.0.0

128.67.255.255

Dst. IP

Src. IP

132.59.64.10

128.67.0.0

128.67.255.255

Src. IP

Dst. IP132.59.64.10

128.67.208.1

pointfor (128.67.208.1,

132.59.64.10)

line segmentfor (128.67.*, 132.59.64.10)

rectanglefor (128.67.*,

132.59.*)


Relationship Between IP Prefix Pairs Relationship Between IP Prefix Pairs

Src. IP

128.67.0.0

128.67.255.255

128.67.32.0

128.67.32.255

Dst. IP

partially overlappingexample:

(128.67.*, 145.39.3.*) and (128.67.32.*, 145.39.*)

disjointexample:

(128.67.*, 121.45.5.*) and (128.67.32.*, 128.44.32.*)

fully overlappingexample:

(128.67.*, 167.7.*) and (128.67.32.*, 167.7.4.*)

121.45.5.0

121.45.5.255

128.44.32.0

128.44.32.255

145.39.0.0

145.39.3.0

145.39.3.255

145.39.255.255

167. 7.0.0

167. 7. 4.0

167.7.4.255

167. 7. 255.255


Partial Overlaps IP Prefix Pairs vs. Arbitrary RangesPartial Overlaps IP Prefix Pairs vs. Arbitrary Ranges

partially overlapping IP prefix pairs partially overlapping IP prefix pairs always form the shape of a crossalways form the shape of a cross

partially overlapping pairs ofpartially overlapping pairs ofarbitrary ranges may form any shapearbitrary ranges may form any shape


Packet Classification as a Point Location ProblemPacket Classification as a Point Location Problem

Given a point in a d-dimensional spacefind the highest priority hyper-cube that

contains the point

Rule 1

Rule 2

Rule 3

Rule 4

Rule 5

®

PART II: State-of-the-ArtPART II: State-of-the-Art


Packet Classification: An Open ProblemPacket Classification: An Open Problem

Mogul et. al.Packet Filter Concept

1987 1994

Chazele et. al.Point Location AmongHyperplanes

1998

Lakshman and StiliadisBit VectorSrinivasan, Suri, VargheseGrid of Tries, Cross Producting

1999

Gupta, McKeownRecursive Flow ClassificationSrinivasan, Suri, VargheseTuple-Space Search

Baboescu et. al.,Aggregate Bit VectorGupta, McKeownHiCutts

2000 2003

Baboescu et. al.,Extended Grid of TriesSingh et. alHyperCuttsKounavis et. al.Most Specific Filter Matching

2004

Taylor, TurnerDistributed Cross Producting


Multi-dimensional TriesMulti-dimensional Tries

10

0

01

1

Rule 1

Rule 2 Rule 4

Rule 3

Destination Trie

Source Tries

backtracking

Rule 1: (*, 1*)

Rule 2: (0*, 0*)

Rule 3: (1*, 00*)

Rule 4: (1*, 0*)

Rule Database

Packet: (1101, 0011)


Grid of TriesGrid of Tries

10

0

01

1

Rule 1

Rule 2 Rule 4

Rule 3

0

backtracking is avoidedusing switch pointers

Rule 1: (*, 1*)

Rule 2: (0*, 0*)

Rule 3: (1*, 00*)

Rule 4: (1*, 0*)

Rule Database

Packet: (1101, 0011)


Bit Vector SchemesBit Vector Schemes

Dimension 1 Dimension 2 Dimension k

LPM Search LPM Search LPM Search

Bit Vector 1 Bit Vector 2 Bit Vector k

Logical ‘AND’ and Priority Resolution


Cross ProductingCross Producting

Dimension 1 Dimension 2 Dimension k

LPM Search LPM Search LPM Search

Index 1 Index 2 Index k

Index concatenation and hash lookup


Tuple-space SearchTuple-space Search

Key idea: Number of different combinations of prefix lengths (called tuples) is small (called tuples) is small

Rule 1: (*, 1*)

Rule 2: (0*, 0*)

Rule 3: (1*, 0*)

Rule 4: (1*, 00*)

Rule Database

Tuple 1: (*, 1*) Rule 1

Tuple 2: (1*, 1*) Rules 2 and 3

Tuple 3: (1*, 11*) Rule 4

Tuple Space

Hash Table 1

hash

lookuppacketAND

tuple 1

Hash Table 2

hash

lookuppacketAND

tuple 2

Hash Table 3

hash

lookuppacketAND

tuple 3


Recursive Flow ClassificationRecursive Flow Classification

packet

index 1

index 2

index 3

index 4

index 5

index 6

action

first recursion second recursion third recursion


HiCutts and HyperCuttsHiCutts and HyperCuttspacket

HiCutts: Cuts 1 dimension at a time

Rule 1Rule 2

…Rule m

HyperCutts: Cuts multiple dimensions at a time

packet


TCAMTCAM

0 1 0 1 1 0

Memory Array

Priority Encoder

Memory Location

Action Memory

TCAM

RAM

Matches

Entries

field values in the packet header


ComparisonComparisonAlgorithm Worst Case Lookup

Time ComplexityWorst Case Storage

Complexity

Multidimensional Tries wd ndw

Grid of Tries wd-1 ndw

Bit Vector dw + dn/a dn2/a

Tuple Space Search n n

Cross Producting dw nd

Recursive Flow Classification

d nd

HiCutts d nd

TCAM 1 n

n: number of rules, d: number of fields, w: field size

®

PART III: Observations on Real World ClassifiersPART III: Observations on Real World Classifiers


What we ObservedWhat we Observed IP IP prefix pairs:prefix pairs:

create partial overlaps which are significantly create partial overlaps which are significantly fewer than the theoretical worst casefewer than the theoretical worst case

transport level fields:transport level fields: form sets which are being shared by many form sets which are being shared by many

different source-destination IP prefix pairsdifferent source-destination IP prefix pairs

sets usually contain a small number of entriessets usually contain a small number of entries


Toward Two Stage Packet ClassificationToward Two Stage Packet Classification

IP Prefix Pair m

IP Prefix Pair 3

IP Prefix Pair 2

IP Prefix Pair 1

…

Src. IP

Dst. IP

packetfirst stage finds themost specific filter

for the packet

Set #1

Set #2

Set #k

Second stage obtainsa list of pointers to

shared sets of transport level fields

HW accelerator

intersection


Observations on IP prefix PairsObservations on IP prefix Pairs

IP prefix pairs are of 2 typesIP prefix pairs are of 2 types partially-specified filters (i.e., (*,X) or (Y, *))partially-specified filters (i.e., (*,X) or (Y, *))

fully-specified filtersfully-specified filters

partially specified filters partially specified filters are a small fraction (< 25%) of IP prefix pairsare a small fraction (< 25%) of IP prefix pairs

most fully-specified filters (> 80%) are most fully-specified filters (> 80%) are represented by represented by segments of straight linessegments of straight lines

pointspoints


IP Prefix Pair OverlapsIP Prefix Pair Overlaps

n2/4 amount of overlaps

filter structure with O(n2) overlaps realistic filter structure

(*, *)

cluster #1

(*, X)

(Y, *)

cluster #2

cluster #m

realistic amount of overlaps ?


Visualizing ACLs with our FilterViewer ToolVisualizing ACLs with our FilterViewer Tool


Partial Filter Overlaps in the Realistic Filter StructurePartial Filter Overlaps in the Realistic Filter Structure

Breakdown of Overlaps

observed number of overlaps

theoretical worst case

overlaps formed by partially specified filters only

overlaps formed by

fully specified

filters only

overlaps formed

between partially and fully specified

filters

ACL1 4 90,525 100% 0% 0%

ACL2 2249 138,601 45% 4% 51%

ACL3 6138 1,260,078 88% 1% 11%

ACL4 852 12,246 100% 0% 0%


Why Few Filter OverlapsWhy Few Filter Overlaps Partially specified filters represent a small Partially specified filters represent a small

fraction of the total number of filters in fraction of the total number of filters in databasesdatabases

Fully specified filters create an insignificant Fully specified filters create an insignificant amount of overlapsamount of overlaps

There is a bounded number of important There is a bounded number of important servers per IP address domainservers per IP address domain


Observations on Transport Level FieldsObservations on Transport Level Fields

relative priority of fields is the same in different occurrences of the field set


Transport Level SharingTransport Level Sharing

number of rulesnumber of unique sets of transport

level fields

number of entries in unique sets of transport level

fields

ACL1 754 102 316

ACL2 607 35 68

ACL3 2399 186 437

ACL4 157 8 47


ImplicationsImplications Classification can be split into 2 stages:Classification can be split into 2 stages:

Stage 1: IP address fieldsStage 1: IP address fields

Stage 2: transport level fieldsStage 2: transport level fields

A design that returns the smallest filter A design that returns the smallest filter intersection is viable intersection is viable since the amount of overlaps between IP prefix pairs is since the amount of overlaps between IP prefix pairs is

smallsmall

Searching through transport level fields can be Searching through transport level fields can be accelerated by hardwareaccelerated by hardware

®

PART IV: Two Stage Packet Classification Using Most Specific Filter Matching and Transport Level Sharing

PART IV: Two Stage Packet Classification Using Most Specific Filter Matching and Transport Level Sharing


Cross Producting RevisitedCross Producting Revisited

source IP address

destination IP address

132.59.255.255

132.59.10.0

132.59.10.255

128.67.0.0

128.67.32.0

128.67.255.255

128.67.32.255

121.45.5.0

121.45.5.255

125.12.12.255

125.12.12.0

cross product =(128.67.32.*, 132.59.10.*)

132.59.0.0

(128.67.32.*, 121.45.5.*) packet =(128.67.32.5, 132.59.10.10)

most specific filter =(128.67.*, 132.59.*)

(125.12.12.*, 132.59.10.*)

Cross Producting may return a non existent filter. To address this issue Cross

Producting adds all possible filters returned from LPM searches into its lookup table


Definition of a Cross ProductDefinition of a Cross Product

A filter with: A filter with: a source IP prefix equal to the prefix of a filter a source IP prefix equal to the prefix of a filter

FF11 from a database; and from a database; and

a destination IP prefix equal to the prefix of a destination IP prefix equal to the prefix of another filter another filter FF22 from the same database from the same database

FF22 is not necessarily equal to is not necessarily equal to FF11


Improving Cross ProductingImproving Cross Producting Cross Producting is fast but…Cross Producting is fast but…

Memory explosionMemory explosion

For ACL3 there are 431 src. IP prefixes and 517 dst. For ACL3 there are 431 src. IP prefixes and 517 dst. Prefixes, hence 222,396 cross productsPrefixes, hence 222,396 cross products

SolutionSolution We can remove 70%-80% of the cross products with We can remove 70%-80% of the cross products with

little penalty to the performance of the classifierlittle penalty to the performance of the classifier Not covered cross productsNot covered cross products

Partially covered productsPartially covered products


Motivating ExampleMotivating Example

A (*, *)

B

C

E

R1

R 2

R3

R4

R5R 6 R 7

121.45.5.0

121.45.5.255

132.59.10.0 255.255.255.255

128.67.0.0

128.67.32.255

125.12.12.255

125.12.12.0

128.67.32.0

128.67.255.255

147.101.10.0

147.101.10.255

255.255.255.255

132.59.10.255

132.59.255.255

D

132.59.0.0

do we need to store all

cross products R1-R7?


Not Covered Cross ProductsNot Covered Cross Products

source IP address


132.59.10.0132.59.10.255

128.67.32.0

128.67.32.255

121.45.5.0121.45.5.255

125.12.12.255

125.12.12.0

(125.12.12.*, 132.59.10.*)

(128.67.32.*, 121.45.5.*)

not covered cross product =(125.12.12.*, 121.45.5.*)

‘not covered’ cross products are only covered by (*,*). Hence they can be removed from the lookup table.

If no match is found the algorithm returns (*, *)


Partially Covered Cross ProductsPartially Covered Cross Products

‘partially covered’ cross products are only covered by (X,*) or (*, Y). Hence they can be removed from the

lookup table. If no match is found, the algorithm checks a database of partially specified filters. If no

match is found again the algorithm returns (*,*)

128.67.32.255

125.12.12.255

source IP address


132.59.10.0132.59.10.255

128.67.32.0

121.45.5.0121.45.5.255

125.12.12.0

(128.67.32.*, 121.45.5.*)

(125.12.12.*, 132.59.10.*)

partially covered cross product =(125.12.12.*, 121.45.5.*)

255.255.255.255

(125.12.12.*,*)


Fully Covered Cross ProductsFully Covered Cross Products

those cross products which are neither ‘not those cross products which are neither ‘not covered’ nor ‘partially covered’covered’ nor ‘partially covered’ fully specified filtersfully specified filters

filter intersections that are fully specifiedfilter intersections that are fully specified

filters which are: filters which are: formed by combining the source and destination formed by combining the source and destination

IP prefixes of different IP prefix pairs; and IP prefixes of different IP prefix pairs; and

contained into fully-specified filters or fully-contained into fully-specified filters or fully-specified filter intersectionsspecified filter intersections

these are called ‘indicator filters’ these are called ‘indicator filters’


Most Specific Filter MatchingMost Specific Filter Matching

index 1 index 2

index 1 index 2

lookup on atable of

fully coveredcross products

index 1 index 2

index 1 index 2

primary table

index 3

secondary table of filters

of the form(X, *)

index 4

LPM lookupon the

source IPaddress

LPM lookupon the

destination IPaddress

secondary table of filters

of the form(*, Y)


Back to the ExampleBack to the Example

A (*, *)

B

C

E

R1

R 2

R3

R4

R5R 6 R 7

121.45.5.0

121.45.5.255

132.59.10.0 255.255.255.255

128.67.0.0

128.67.32.255

125.12.12.255

125.12.12.0

128.67.32.0

128.67.255.255

147.101.10.0

147.101.10.255

255.255.255.255

132.59.10.255

132.59.255.255

D

132.59.0.0

cross products R1, R3-R7 can

be removed from primary

table


Transport Level SharingTransport Level Sharing Rules that specify the same source-destination IP Rules that specify the same source-destination IP

prefix pair are consecutiveprefix pair are consecutive

These rules share sets of transport level fieldsThese rules share sets of transport level fields

Most specific filter matching returns a list of Most specific filter matching returns a list of pointers to share sets of transport level fields pointers to share sets of transport level fields

Packet classification in the transport level Packet classification in the transport level dimensions is done in hardwaredimensions is done in hardware


Hardware AccelerationHardware Accelerationapproach: we do not need to expand a range into prefixes

• We use a pair of comparators• We compare a key with upper and lower bounds

bin #1(e.g., 13 entries)



…

…

2Kfast memory

entries




…

…

entries

16 entry comparator

circuit custom hardware

Stage 2 Lookup

comparator

… …

key (16 bits) lower bound

… …

key upper bound

borrow

borrow

“match”

comparator

comparator

… …

key (16 bits) lower bound

… …

key upper bound

borrow

borrow

“match”

comparator

Range matching circuit for a field

key shifter & mask

16 bits 16 bitrange matching

key shifter & mask


key shifter & mask

Regular TCAMcheck

“match”

done in softwarekey shifter &

mask16 bits 16 bit

range matching

key shifter & mask


key shifter & mask

16 bitsRegular TCAM

check

“match”

Custom Hardware Design for an entry


PerformancePerformance Lookup timeLookup time

Small and predictable number of steps independent of Small and predictable number of steps independent of the number of rulesthe number of rules

11 memory accesses11 memory accesses

Memory SpaceMemory Space Reasonable: 19-446 KB (for ACLs 1-4 without HW Reasonable: 19-446 KB (for ACLs 1-4 without HW

acceleration)acceleration)

Memory Access BWMemory Access BW 64 words/access (without HW acceleration)64 words/access (without HW acceleration) 4 words/access (with HW acceleration)4 words/access (with HW acceleration)

Update TimeUpdate Time Approximately 197,000 memory accesses Approximately 197,000 memory accesses

®

Tutorial Summary and ConclusionTutorial Summary and Conclusion


Tutorial Summary (I)Tutorial Summary (I)

Packet ClassificationPacket Classification A complex open problemA complex open problem

State of the artState of the art Existing schemes trade-off the lookup time or Existing schemes trade-off the lookup time or

the memory requirementthe memory requirement


Tutorial Summary (II)Tutorial Summary (II) Observations on real world classifiersObservations on real world classifiers

Few partial overlaps between IP prefix pairsFew partial overlaps between IP prefix pairs

Shared sets of transport level fieldsShared sets of transport level fields

Proposed a new schemeProposed a new scheme Exploits classifier propertiesExploits classifier properties

Predictable lookup time with reasonable Predictable lookup time with reasonable memory requirementmemory requirement

Requires HW acceleration Requires HW acceleration


Future WorkFuture Work

Verify the scheme using more data setsVerify the scheme using more data sets

Simplify the update processSimplify the update process

Apply other fast solutions to the first stageApply other fast solutions to the first stage HyperCuttsHyperCutts

®

Packet SchedulingPacket Scheduling


Tutorial SummaryTutorial Summary PART I: PART I:

Understanding the problemUnderstanding the problem

PART II: PART II: State-of-the-artState-of-the-art

PART III: PART III: Sorting packets by packet schedulers using the Sorting packets by packet schedulers using the

Connected Trie data structureConnected Trie data structure

PART IV: PART IV: Building a four level, OC-48, programmable Building a four level, OC-48, programmable

hierarchical packet schedulerhierarchical packet scheduler

®

PART I: Understanding the ProblemPART I: Understanding the Problem


The Concept of QoSThe Concept of QoS Packet networks Packet networks

Usually provide best effort servicesUsually provide best effort services

Can we make packet networks capable of Can we make packet networks capable of delivering continuous media?delivering continuous media?

Key concept: make packet networks flow aware Key concept: make packet networks flow aware

MechanismsMechanisms SchedulingScheduling

ShapingShaping

Resource ReservationResource Reservation

Admission ControlAdmission Control


Packet SchedulingPacket Scheduling

trafficshaper

head-of-linepackets

scheduler

session 1

session N

…

delay jitter

burst

shaped traffic


Generalized Processor Sharing (GPS)Generalized Processor Sharing (GPS)

Ideal scheduling disciplineIdeal scheduling discipline Visits each nonempty queue and serves an Visits each nonempty queue and serves an

infinitesimally small amount of datainfinitesimally small amount of data

Supports exact max-min fair share allocationSupports exact max-min fair share allocation Resources are allocated in order of increasing demandResources are allocated in order of increasing demand

No source gets a resource share larger than its demandNo source gets a resource share larger than its demand

Sources with unsatisfied demands get an equal share of Sources with unsatisfied demands get an equal share of the resourcethe resource


Weighted Fair Queuing (WFQ)Weighted Fair Queuing (WFQ)

Key idea: If you can’t implement GPS Key idea: If you can’t implement GPS simulate it on the sidesimulate it on the side

The algorithm:The algorithm: Tag packets with numbers denoting the order Tag packets with numbers denoting the order

of completion of service according to the of completion of service according to the simulated GPS disciplinesimulated GPS discipline

Transmit the packets in the ascending order of Transmit the packets in the ascending order of their tagstheir tags


Time Stamp CalculationTime Stamp Calculation

tag of the next packet in the queue

= MAX(tag of the

previous packet in the queue

round number of the simulated

GPS service, ) +

packet size

connection weight

most difficultto determine


ExampleExample

6 8

2

4

0slope = 1/3

slope = 1

round number ofthe simulated GPS

discipline

time

4 2 2WFQ

3 connections of equal weights

three packets of sizes 2, 2 and 4units arrive concurrently

capacity = 1unit/sec

2

2

4

packets of size 2complete transmission

packet of size 4completes

transmission


Relative Fairness BoundRelative Fairness Bound

Relative Fairness

Bound= MAX |

service received by connection Aduring an interval

rate allocated to A

service received by connection B

during this interval

rate allocated to B

- |

A, B = backlogged connections


Absolute Fairness BoundAbsolute Fairness Bound

AbsoluteFairness

Bound= MAX |

service received by connection Aduring an interval

rate allocated to A

service received by connection A

during this intervalif serviced by GPS

rate allocated to A

- |

A = backlogged connection


Hierarchical Packet SchedulingHierarchical Packet Scheduling

Single level schedulingSingle level scheduling Transmission order does not depend on future arrivalsTransmission order does not depend on future arrivals

Hierarchical schedulingHierarchical scheduling Transmission order depends on future arrivalsTransmission order depends on future arrivals

To implement hierarchical GPS we need to built a hierarchy of single To implement hierarchical GPS we need to built a hierarchy of single level fair queuing disciplines level fair queuing disciplines


Schedulable RegionSchedulable Region Set of all possible combinations of Set of all possible combinations of

performance bounds a scheduler can performance bounds a scheduler can simultaneously meetsimultaneously meet

Class I

Class IIClass III

®

PART II: State-of-ArtPART II: State-of-Art


Tagging and Sorting SchemesTagging and Sorting Schemes

Demers, Keshav, ShenkerWFQ

1989 1991

Lazar, Hyman, PacificiMARS

1993 1995

Goyal, Vin, ChengStart Time Fair QueingBennet, Stephens, ZhangCMU Sorting SchemeRexford,Bonomi, Greenberg AT&T Sorting Scheme

1996 2003

Ramabhadran, PasqualeStratified Round Robin

Kounavis, Kumar, YavatkarConnected Trie Data Structure

2004

ValenteExact GPS Simulationwith logarithmiccomplexity

Parekh, GallagerGeneralized Processor SharingLazar, Hyman, PacificiSchedulable Region

GolestaniSelf Clocked Fair Queing

Shreedhar, Varghese Deficit Round Robin

1994


Self-Clocked Fair Queuing (SCFQ)Self-Clocked Fair Queuing (SCFQ)


GPS service

finish time of the packet currently

in service =

Same as WFQ apart from:

Main disadvantage: large end-to-end delay


Start Time Fair Queuing (SFQ)Start Time Fair Queuing (SFQ)


GPS service

start time of the packet currently

in service =

Transmission order: Ascending order of start timesSame end-to-end delay as WFQ



W2FQW2FQ

transmission order

select the packet with the minimum tag from among those that have alreadystarted service in the corresponding

GPS simulation

=

Smaller Absolute Fairness Bound



Round Robin SchedulingRound Robin Scheduling

Deficit Round RobinDeficit Round Robin Each connection has a deficit counterEach connection has a deficit counter

Every round the deficit counter is incremented Every round the deficit counter is incremented by a quantumby a quantum

If the packet size < counter then the packet is If the packet size < counter then the packet is transmitted and the counter is reduced by the transmitted and the counter is reduced by the packet sizepacket size


Exact GPS Simulation with Logarithmic Complexity Exact GPS Simulation with Logarithmic Complexity

L-GPS simulates GPS with minimum deviation of L-GPS simulates GPS with minimum deviation of one packet size at one packet size at OO(log(logNN) complexity) complexity All other well known schedulers that accomplish the All other well known schedulers that accomplish the

same deviation (e.g., Wsame deviation (e.g., W22FQ) have O(FQ) have O(NN) complexity) complexity

Key idea:Key idea: L-GPS pre-computes the evolution of the round number L-GPS pre-computes the evolution of the round number

function of the simulated GPS service using a tree function of the simulated GPS service using a tree structurestructure


Some Sorting Data StructuresSome Sorting Data Structures

HeapsHeaps

Binomial heapsBinomial heaps

Calendar QueuesCalendar Queues

Van Emde Boas TreesVan Emde Boas Trees

Trees of ComparatorsTrees of Comparators

CMU SortingCMU Sorting

AT&T SortingAT&T Sorting

Polytechnic Institute SortingPolytechnic Institute Sorting


The Tree of ComparatorsThe Tree of ComparatorsYou divide packets into groups

You send each group to a stage of comparators

You obtain a minimum from each comparator

You pass the minima into a second stage of comparators

You repeat the process until one packet remains

…

second stage of

comparators

…

final stage of

comparators

packet with minimum tag

group of head-of-line

packets

…

line group of

head-of-packets

…

packet in group with

minimum tag

…

…

first stage of

comparators

comparator


The Sorting Scheme from AT&TThe Sorting Scheme from AT&T

FIFO 1

FIFO 2

FIFO k

Bin 1

Bin 2

Bin m

Connection FIFOs Sorting Bins

scheduling horizon


The Sorting Scheme from Polytechnic Inst. Brooklyn The Sorting Scheme from Polytechnic Inst. Brooklyn

Sorting is supported by a hierarchy of bit vectors

Range of time stamp values

®

PART III: Sorting Packets by Packet Schedulers Using the Connected Trie Data Structure

PART III: Sorting Packets by Packet Schedulers Using the Connected Trie Data Structure


ContributionContribution We propose a sorting algorithm and data structure that We propose a sorting algorithm and data structure that

reduces the latency of making scheduling decisions to a reduces the latency of making scheduling decisions to a singe memory access timesinge memory access time Solution is applicable to SCFQ, SFQSolution is applicable to SCFQ, SFQ

Key ObservationKey Observation Increments on packet time stamps are between the range: Increments on packet time stamps are between the range:

(maximum packet size)/(minimum weight)(maximum packet size)/(minimum weight) This is called the scheduling horizonThis is called the scheduling horizon

ApproachApproach We represent the scheduling horizon as a trieWe represent the scheduling horizon as a trie We put state into the nodes of the trie to allow the leaves to We put state into the nodes of the trie to allow the leaves to

be connected into linked list be connected into linked list


Trie-based OrderingTrie-based Ordering

trie structure of heighth = log(scheduling horizon/region width)

0

scheduling horizon =

maximum packet size

minimumweight

10

0 1 1

region of time stampsfor packet A

trie traversal for packet A

region of time stampsfor packet B

trie traversalfor packet B


Van Emde Boas Trees and the Connecting TrieVan Emde Boas Trees and the Connecting Trie

linear traversal binary traversal(Van Emde Boas Tree)

optimal traversallinear traversal optimal traversal(Connected Trie)


Main ConceptsMain Concepts Each node stores: Each node stores:

a pointer to the rightmost leaf a pointer to the rightmost leaf

– with the highest value from among those found by traversing the left child of the node

a pointer to the leftmost leaf a pointer to the leftmost leaf

– with the lowest value from among those found by traversing the right child of the node

When a new packet is added into the trieWhen a new packet is added into the trie

– The algorithm ‘discovers’ the rightmost and leftmost leaves the new packet should be connected to

– The new packet is inserted into a linked list of leaves

Hence the next packet for transmission into the network is Hence the next packet for transmission into the network is found in a single memory access timefound in a single memory access time


Example: Root onlyExample: Root only

R (-∞, +∞)


Example: Adding 13Example: Adding 13

R (-∞, 13)

NULL

C

B

A

13

(13, +∞)

(-∞, 13)

(-∞, 13)

1

1

1

0

NULL



C

B

D

E

F

5

R

A

13

(5, 13)

(13, +∞)

(-∞, 13)

(-∞, 13)

(-∞, 5)

(5, 13)

(-∞, 5)

NULL NULL

0

0

1

1

1

1

1

0



C

B

10

H

D

E

F

5

R

A

G

13

(5, 10)

(5, 10) (13, +∞)

(-∞, 13)

(10, 13)

(10, 13)(-∞, 5)

(5, 13)

(-∞, 5)

NULL NULL

0

0

1

1

1

1

1

1

0

0

0



C

B

10

H

8

D

E

F

5

R

A

I

G

13

(5, 8)

(8, 10) (13, +∞)

(-∞, 13)

(10, 13)

(10, 13)(8, 10)(-∞, 5)

(5, 13)

(-∞, 5)

NULL NULL

0

0

1

1

1

1

1

1

0

0

0 0

0


The Node Traversal AlgorithmThe Node Traversal Algorithm

visit a node

update the information about the leftmost and rightmost leaves where

the new element should be connected to

if the new element is greater (lower) than the rightmost (leftmost) leaf of the node, update the state

on the node

visit the next node


Using Two Tries at a TimeUsing Two Tries at a Time Why two tries at a time:Why two tries at a time:

Let’s assume Let’s assume D D is the scheduling horizonis the scheduling horizon

During the transmission of a packet, new packets will be During the transmission of a packet, new packets will be associated with time stamp increments at most associated with time stamp increments at most DD..

During the transmission of these packets, new packets will be During the transmission of these packets, new packets will be associated with time stamp increments at most 2associated with time stamp increments at most 2DD..

Trie 1 Trie 2 Trie 3

once all packets in Trie 1are transmitted we reuse the memory

and start building Trie 3 a.s.o

Trie 4


Optimal Height of the Connected TrieOptimal Height of the Connected Trie

optimal height

= log(

least common multiple of weights

greatest commondivisor of packet

sizes

maximum packet size

minimum weight

X )


PerformancePerformance Scheduling decision timeScheduling decision time

Exactly 1 memory access independent of the number of Exactly 1 memory access independent of the number of flows in the schedulerflows in the scheduler

Insertion timeInsertion time 6 read accesses, 7 write accesses for a trie of height 12.6 read accesses, 7 write accesses for a trie of height 12.

Memory access bandwidthMemory access bandwidth 10 words per read access, 6 words per write access for a 10 words per read access, 6 words per write access for a

trie of height 12trie of height 12

Memory requirementMemory requirement 34KB for 256 connections, 213 KB for 64K connections, 34KB for 256 connections, 213 KB for 64K connections,

for a trie of height 12 for a trie of height 12

®

PART IV: Building a Four Level, OC-48, Programmable Hierarchical Packet Scheduler

PART IV: Building a Four Level, OC-48, Programmable Hierarchical Packet Scheduler


ContributionContribution Efficient implementation of hierarchical scheduling on Efficient implementation of hierarchical scheduling on

the IXP2xxx series and the next generation the IXP2xxx series and the next generation processorsprocessors Support for OC48 on IXP28xxSupport for OC48 on IXP28xx

Budget of 228 cycles/packet for scheduling only!Budget of 228 cycles/packet for scheduling only!

Support for multiple levels of hierarchy upto 5 levelsSupport for multiple levels of hierarchy upto 5 levels

Support for a total of <= 256K input queues with Support for a total of <= 256K input queues with arbitrary weights at each levelarbitrary weights at each level

Any possible configuration of the hierarchies should be Any possible configuration of the hierarchies should be supportedsupported


Examples of SchedulersExamples of Schedulers

Egress Scheduler Configuration

SCFQ

SCFQ

… SCFQ

Fabric

Level 1

Level 00

3

SCFQ

SCFQ

…

SCFQ

…

SCFQ

0

256

256

0

Level 2

0

256

…

pipes

0

256

…

pipes

262144 pipestotal

SCFQ

SCFQ

… SCFQ

Fabric

Level 1

Level 00

3

SCFQ

SCFQ

…

SCFQ

…

SCFQ

0

256

256

0

Level 2

0

256

…

pipes

0

256

…

pipes

total

0

15

… WRR

0

15

… WRR

VOQs

VOQsRR

Fabric Port 0

Fabric Port 255

Fabric 0

4095

… WRR

Fabric

VOQs

Ingress Scheduler Configurations

0

7

… SCFQ

0

7

… SCFQ

pipes

pipesSCFQ

Level 5

Level 1 0

262143

…SCFQ

Fabric

pipes

Egress Scheduler Configurations

…

32K pipestotal

0

15

… WRR

0

15

… WRR

VOQs

VOQsRR

Fabric Port 0

Fabric Port 255

Fabric 0

4095

… WRR

Fabric

VOQs

0

7

… SCFQ

0

7

… SCFQ

pipes

pipesSCFQ

Level 5

0

262143

…SCFQ

Fabric

pipes…

total


ApproachApproach Supporting many single level schedulers (e.g., Supporting many single level schedulers (e.g.,

64K) using a limited number of microengines and 64K) using a limited number of microengines and threadsthreads

We assign a small number of threads to serve each entire We assign a small number of threads to serve each entire level of the hierarchy as opposed to a single scheduler level of the hierarchy as opposed to a single scheduler only! only!

packets are exchanged between levels at line ratepackets are exchanged between levels at line rate

Bandwidth sharing takes place at different levels in Bandwidth sharing takes place at different levels in parallel!parallel!

Sorting of PacketsSorting of Packets We use the connected trie and tree of comparators We use the connected trie and tree of comparators

structures structures


Parallelizing the Hierarchical SchedulerParallelizing the Hierarchical Scheduler We assign a small number of threads to serve each We assign a small number of threads to serve each

level of the hierarchylevel of the hierarchy We can do that because packets are exchanged between We can do that because packets are exchanged between

levels at line ratelevels at line rate

We insert pre-sorted packets at each level We insert pre-sorted packets at each level We can do that because hierarchical schedulers consist of We can do that because hierarchical schedulers consist of

independent single level schedulers independent single level schedulers

We parallelize the dequeue processing at each of the We parallelize the dequeue processing at each of the levelslevels

A dequeue thread creates a hole (i.e., empty packet space)A dequeue thread creates a hole (i.e., empty packet space)

A hole filling thread at the next level fills the hole by A hole filling thread at the next level fills the hole by inserting a new packetinserting a new packet


High Level Design from First Order PrinciplesHigh Level Design from First Order Principles

Fact, Assumption or Design Principle

Consequence High-Level Design Guideline

a hierarchical scheduler consists of single-level schedulers

we can insert presorted packets at each level

we address the sorting problem locally at each single level scheduler

packets need to be exchanged between levels at line rate

we can assign a small number of threads to serve each entire level of the

hierarchy

the levels of the hierarchy can operate in parallel independent of

each other

each packet transmitted at a level creates an empty packet space, which

we call a ‘hole’

to fill a hole you may need to perform at least one SRAM access which can be as

large as 300 compute cycles

we need to buffer more than one packet at each level of the scheduler

the enqueuing process may insert packets into single-level schedulers

at any level

enqueuing and hole-filling threads may need to access the same state

information concurrently

mutual exclusion techniques are required

calculating a virtual time function is complex

virtual time can be approximated by the finish time of the packet currently in

service (SCFQ)

a packet entering a scheduler does not need to preempt the packet

currently in service


Parallelized Hierarchical Scheduler IllustratedParallelized Hierarchical Scheduler Illustrated

…

line rate

line rateline rate

line rate

level 1

threads

level 1

threadsthreads

level

threads

level 2

threadsthreads

leaf level

threads

leaf level

threadsthreads

root level

threadsthreadsthreads

…

SCFQ

SCFQ

SCFQ

SCFQ

…

SCFQ

SCFQ

SCFQ

SCFQ

…

SCFQ

SCFQ

SCFQ

SCFQ

SCFQ

single levelschedulers

are represented by state in

SRAM memory

communicationacross levels is donethrough local memory

or next neighbor registers


Meeting the OC-48 Line Rate with BufferingMeeting the OC-48 Line Rate with Buffering

Final Output

Why buffering?Why buffering? To cope for the fact that a single SRAM access may take more time 228 cyclesTo cope for the fact that a single SRAM access may take more time 228 cycles

Enqueue ProcessEnqueue Process We maintain We maintain two minimum tagtwo minimum tag packets at the output of each scheduler packets at the output of each scheduler

Dequeue ProcessDequeue Process While a hole is filled, the next packet While a hole is filled, the next packet is ready to be servicedis ready to be serviced

SCFQ

SCFQ

SCFQ

Level 0

Level 1Head of line packets

Input queues– usehardware queues

A hole filling threadstarts inserting a new packet

in the root scheduler

a packet is being transmitted

a second packetbuffered at the root is

transmitted


Hierarchical Scheduler PrototypedHierarchical Scheduler Prototyped

Egress Scheduler Configuration

SCFQ

SCFQ

… SCFQ

Fabric

Level 1

Level 00

3

SCFQ

SCFQ

…

SCFQ

…

SCFQ

0

256

256

0

Level 2

0

256

…

pipes

0

256

…

pipes

262144 pipestotal

SCFQ

SCFQ

… SCFQ

Fabric

Level 1

Level 00

3

SCFQ

SCFQ

…

SCFQ

…

SCFQ

0

256

256

0

Level 2

0

256

…

pipes

0

256

…

pipes

total

0

15

… WRR

0

15

… WRR

VOQs

VOQsRR

Fabric Port 0

Fabric Port 255

Fabric 0

4095

… WRR

Fabric

VOQs

Ingress Scheduler Configurations

0

7

… SCFQ

0

7

… SCFQ

pipes

pipesSCFQ

Level 5

Level 1 0

262143

…SCFQ

Fabric

pipes

Egress Scheduler Configurations

…

32K pipestotal

0

15

… WRR

0

15

… WRR

VOQs

VOQsRR

Fabric Port 0

Fabric Port 255

Fabric 0

4095

… WRR

Fabric

VOQs

0

7

… SCFQ

0

7

… SCFQ

pipes

pipesSCFQ

Level 5

0

262143

…SCFQ

Fabric

pipes…

total

0

7

… SCFQ

0

7

… SCFQ

pipes

pipesSCFQ

Level 5

Level 1

…

32K connectionstotal


Use of the IXP MicroenginesUse of the IXP Microengines

HF2BHF3BQMIB

…………

level 1level 2level 3level 4 level 0(root)

HF0HF1AHF2AHF3AQMIA

DT

QueueManager

HF1B

hole-fillingthreads

EN0A

EN0B

threadsserving levels

0 and 1

EN1A

EN1B

threadsserving levels

2 and 3

EN2A

EN2B

threadsserving level

4

enqueuingthreads

ME0ME1

ME2


RemarksRemarks Four level OC-48 line rate forwarding (2.5 Gbps)Four level OC-48 line rate forwarding (2.5 Gbps)

Data structures fit into the local memory and Data structures fit into the local memory and SRAM of IXP2400SRAM of IXP2400

We keep the memory access bandwidth We keep the memory access bandwidth consumption at reasonable level consumption at reasonable level We fetch either the heads or tails of queues but not both We fetch either the heads or tails of queues but not both

at the same timeat the same time

We employ novel inter-thread synchronization We employ novel inter-thread synchronization algorithmsalgorithms

®

Tutorial Summary and ConclusionTutorial Summary and Conclusion


Tutorial Summary (I)Tutorial Summary (I) Packet schedulingPacket scheduling

Critical component of router datapathsCritical component of router datapaths

Generalized Processor SharingGeneralized Processor Sharing Ideal service (non-implementable)Ideal service (non-implementable)

Real implementationsReal implementations Annotate packets with time stampsAnnotate packets with time stamps

Sort packets according to their time stamp Sort packets according to their time stamp valuesvalues


Tutorial Summary (II)Tutorial Summary (II) We propose a new algorithm for sorting We propose a new algorithm for sorting

packetspackets Reduces the latency of making scheduling Reduces the latency of making scheduling

decisions to a single memory access timedecisions to a single memory access time

Parallelized Processor Architectures like Parallelized Processor Architectures like IXP2xxx are suitable for implementing IXP2xxx are suitable for implementing packet scheduling in softwarepacket scheduling in software


Future WorkFuture Work

Use the Connected Trie for implementing Use the Connected Trie for implementing disciplines other than SFQ, SCFQdisciplines other than SFQ, SCFQ Example: L-GPS?Example: L-GPS?

Use the Connected Trie in multi-level Use the Connected Trie in multi-level hierarchical scheduler configurations hierarchical scheduler configurations


Thanks for ListeningThanks for Listening


ReferencesReferences Michael E. Kounavis, Alok Kumar, Raj Yavatkar and Harrick Vin, Michael E. Kounavis, Alok Kumar, Raj Yavatkar and Harrick Vin,

““Two Stage Packet Classification Using Most Specific Filter Matching Two Stage Packet Classification Using Most Specific Filter Matching and Transport Level Sharingand Transport Level Sharing”, ”,

Technical ReportTechnical Report, Communications Technology Lab, Intel Corporation, , Communications Technology Lab, Intel Corporation, In Submission to Computer NetworksIn Submission to Computer Networks

Michael E. Kounavis, Alok Kumar, and Raj Yavatkar, Michael E. Kounavis, Alok Kumar, and Raj Yavatkar,

““Sorting Packets by Packet Schedulers Using the Connected Trie Data Sorting Packets by Packet Schedulers Using the Connected Trie Data StructureStructure”, ”,

Technical ReportTechnical Report, Communications Technology Lab, Intel Corporation, , Communications Technology Lab, Intel Corporation, In Submission to Software Practice and ExperienceIn Submission to Software Practice and Experience

Michael E. Kounavis, Alok Kumar, and Raj Yavatkar, Michael E. Kounavis, Alok Kumar, and Raj Yavatkar,

““A Four Level OC-48 Programmable Hierarchical Packet SchedulerA Four Level OC-48 Programmable Hierarchical Packet Scheduler”, ”,

Technical ReportTechnical Report, Communications Technology Lab, Intel Corporation, Communications Technology Lab, Intel Corporation

Documents

® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October