108
® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Michael Kounavis (Intel) Alok Kumar (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Harrick Vin (U. Texas, Austin) Austin) October 26, 2005 October 26, 2005

® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

Embed Size (px)

Citation preview

Page 1: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

®

Line Rate Packet Classification and Scheduling

Line Rate Packet Classification and Scheduling

Michael Kounavis (Intel)Michael Kounavis (Intel)

Alok Kumar (Intel)Alok Kumar (Intel)

Raj Yavatkar (Intel)Raj Yavatkar (Intel)

Harrick Vin (U. Texas, Austin)Harrick Vin (U. Texas, Austin)

October 26, 2005October 26, 2005

Page 2: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

®

Packet ClassificationPacket Classification

Page 3: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 3 • Communications TechnologyCommunications TechnologyLabLab

Tutorial SummaryTutorial Summary PART I: PART I:

Understanding the problemUnderstanding the problem

PART II: PART II: State-of-the-artState-of-the-art

PART III: PART III: Observations on real world classifiersObservations on real world classifiers

PART IV: PART IV: Two stage packet classification using Most Two stage packet classification using Most

Specific Filter Matching and Transport Level Specific Filter Matching and Transport Level Sharing Sharing

Page 4: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

®

PART I: Understanding the ProblemPART I: Understanding the Problem

Page 5: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 5 • Communications TechnologyCommunications TechnologyLabLab

Problem StatementProblem Statement Packet classifiersPacket classifiers

Lists of rulesLists of rules

Rules: <priority, predicate, action> tripletsRules: <priority, predicate, action> triplets

Single Match ProblemSingle Match Problem Find the highest priority rule that matches with Find the highest priority rule that matches with

a packeta packet

Multiple Match ProblemMultiple Match Problem Find all rules that match with a packetFind all rules that match with a packet

Page 6: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 6 • Communications TechnologyCommunications TechnologyLabLab

A Rule DatabaseA Rule Database

PERMIT* 147.101.* 1040-1070 *

DENY* 128.151.* * 2110-2150

DENY132.* 153.* * ftp

Src. IPSrc. IP Dest. IPDest. IPSrc. Src. PortPort

Dst. Dst. PortPort ActionAction

PredicatePredicate

Priority LevelPriority Level

1

2

3

a rule

Page 7: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 7 • Communications TechnologyCommunications TechnologyLabLab

TCP

A RuleA Rule

PERMIT* 147.101.* 2140-2170 *

Src. IP Dest. IPSrc. Port

Dst. Port Action

Predicate

Protocol

port ranges: arbitraryexamples:eq. 1050gt. 1024range 3000-4000

IP address ranges: IP prefixesexamples:128.59.67.0/255.255.255.0128.59.208.0/255.255.240.0128.59.0.0/255.255.0.0

Exact Value

Page 8: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 8 • Communications TechnologyCommunications TechnologyLabLab

An IP PrefixAn IP Prefix

A A range of values in a single dimension

128.67.0.0 128.67.255.2550.0.0.0

Src. IP address

range of values associated with the

Prefix 128.67.*

Page 9: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 9 • Communications TechnologyCommunications TechnologyLabLab

An Arbitrary RangeAn Arbitrary Range

A A range of values in a single dimension

2140 31400

Dst. Port

range of values between 2140 and 3140

Page 10: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 10 • Communications TechnologyCommunications TechnologyLabLab

An Exact ValueAn Exact Value

A A Specific Number

0

Protocol Field

1 6 17

TCP UDPICMP

Page 11: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 11 • Communications TechnologyCommunications TechnologyLabLab

A Source-Destination IP Prefix Pair A Source-Destination IP Prefix Pair

Src. IP

Dst. IP132.59.0.0

132.59.255.255

128.67.0.0

128.67.255.255

Dst. IP

Src. IP

132.59.64.10

128.67.0.0

128.67.255.255

Src. IP

Dst. IP132.59.64.10

128.67.208.1

pointfor (128.67.208.1,

132.59.64.10)

line segmentfor (128.67.*, 132.59.64.10)

rectanglefor (128.67.*,

132.59.*)

Page 12: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 12 • Communications TechnologyCommunications TechnologyLabLab

Relationship Between IP Prefix Pairs Relationship Between IP Prefix Pairs

Src. IP

128.67.0.0

128.67.255.255

128.67.32.0

128.67.32.255

Dst. IP

partially overlappingexample:

(128.67.*, 145.39.3.*) and (128.67.32.*, 145.39.*)

disjointexample:

(128.67.*, 121.45.5.*) and (128.67.32.*, 128.44.32.*)

fully overlappingexample:

(128.67.*, 167.7.*) and (128.67.32.*, 167.7.4.*)

121.45.5.0

121.45.5.255

128.44.32.0

128.44.32.255

145.39.0.0

145.39.3.0

145.39.3.255

145.39.255.255

167. 7.0.0

167. 7. 4.0

167.7.4.255

167. 7. 255.255

Page 13: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 13 • Communications TechnologyCommunications TechnologyLabLab

Partial Overlaps IP Prefix Pairs vs. Arbitrary RangesPartial Overlaps IP Prefix Pairs vs. Arbitrary Ranges

partially overlapping IP prefix pairs partially overlapping IP prefix pairs always form the shape of a crossalways form the shape of a cross

partially overlapping pairs ofpartially overlapping pairs ofarbitrary ranges may form any shapearbitrary ranges may form any shape

Page 14: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 14 • Communications TechnologyCommunications TechnologyLabLab

Packet Classification as a Point Location ProblemPacket Classification as a Point Location Problem

Given a point in a d-dimensional spacefind the highest priority hyper-cube that

contains the point

Rule 1

Rule 2

Rule 3

Rule 4

Rule 5

Page 15: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

®

PART II: State-of-the-ArtPART II: State-of-the-Art

Page 16: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 16 • Communications TechnologyCommunications TechnologyLabLab

Packet Classification: An Open ProblemPacket Classification: An Open Problem

Mogul et. al.Packet Filter Concept

1987 1994

Chazele et. al.Point Location AmongHyperplanes

1998

Lakshman and StiliadisBit VectorSrinivasan, Suri, VargheseGrid of Tries, Cross Producting

1999

Gupta, McKeownRecursive Flow ClassificationSrinivasan, Suri, VargheseTuple-Space Search

Baboescu et. al.,Aggregate Bit VectorGupta, McKeownHiCutts

2000 2003

Baboescu et. al.,Extended Grid of TriesSingh et. alHyperCuttsKounavis et. al.Most Specific Filter Matching

2004

Taylor, TurnerDistributed Cross Producting

Page 17: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 17 • Communications TechnologyCommunications TechnologyLabLab

Multi-dimensional TriesMulti-dimensional Tries

10

0

01

1

Rule 1

Rule 2 Rule 4

Rule 3

Destination Trie

Source Tries

backtracking

Rule 1: (*, 1*)

Rule 2: (0*, 0*)

Rule 3: (1*, 00*)

Rule 4: (1*, 0*)

Rule Database

Packet: (1101, 0011)

Page 18: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 18 • Communications TechnologyCommunications TechnologyLabLab

Grid of TriesGrid of Tries

10

0

01

1

Rule 1

Rule 2 Rule 4

Rule 3

0

backtracking is avoidedusing switch pointers

Rule 1: (*, 1*)

Rule 2: (0*, 0*)

Rule 3: (1*, 00*)

Rule 4: (1*, 0*)

Rule Database

Packet: (1101, 0011)

Page 19: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 19 • Communications TechnologyCommunications TechnologyLabLab

Bit Vector SchemesBit Vector Schemes

Dimension 1 Dimension 2 Dimension k

LPM Search LPM Search LPM Search

Bit Vector 1 Bit Vector 2 Bit Vector k

Logical ‘AND’ and Priority Resolution

Page 20: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 20 • Communications TechnologyCommunications TechnologyLabLab

Cross ProductingCross Producting

Dimension 1 Dimension 2 Dimension k

LPM Search LPM Search LPM Search

Index 1 Index 2 Index k

Index concatenation and hash lookup

Page 21: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 21 • Communications TechnologyCommunications TechnologyLabLab

Tuple-space SearchTuple-space Search

Key idea: Number of different combinations of prefix lengths (called tuples) is small (called tuples) is small

Rule 1: (*, 1*)

Rule 2: (0*, 0*)

Rule 3: (1*, 0*)

Rule 4: (1*, 00*)

Rule Database

Tuple 1: (*, 1*) Rule 1

Tuple 2: (1*, 1*) Rules 2 and 3

Tuple 3: (1*, 11*) Rule 4

Tuple Space

Hash Table 1

hash

lookuppacketAND

tuple 1

Hash Table 2

hash

lookuppacketAND

tuple 2

Hash Table 3

hash

lookuppacketAND

tuple 3

Page 22: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 22 • Communications TechnologyCommunications TechnologyLabLab

Recursive Flow ClassificationRecursive Flow Classification

packet

index 1

index 2

index 3

index 4

index 5

index 6

action

first recursion second recursion third recursion

Page 23: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 23 • Communications TechnologyCommunications TechnologyLabLab

HiCutts and HyperCuttsHiCutts and HyperCuttspacket

HiCutts: Cuts 1 dimension at a time

Rule 1Rule 2

…Rule m

HyperCutts: Cuts multiple dimensions at a time

packet

Page 24: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 24 • Communications TechnologyCommunications TechnologyLabLab

TCAMTCAM

0 1 0 1 1 0

Memory Array

Priority Encoder

Memory Location

Action Memory

TCAM

RAM

Matches

Entries

field values in the packet header

Page 25: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 25 • Communications TechnologyCommunications TechnologyLabLab

ComparisonComparisonAlgorithm Worst Case Lookup

Time ComplexityWorst Case Storage

Complexity

Multidimensional Tries wd ndw

Grid of Tries wd-1 ndw

Bit Vector dw + dn/a dn2/a

Tuple Space Search n n

Cross Producting dw nd

Recursive Flow Classification

d nd

HiCutts d nd

TCAM 1 n

n: number of rules, d: number of fields, w: field size

Page 26: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

®

PART III: Observations on Real World ClassifiersPART III: Observations on Real World Classifiers

Page 27: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 27 • Communications TechnologyCommunications TechnologyLabLab

What we ObservedWhat we Observed IP IP prefix pairs:prefix pairs:

create partial overlaps which are significantly create partial overlaps which are significantly fewer than the theoretical worst casefewer than the theoretical worst case

transport level fields:transport level fields: form sets which are being shared by many form sets which are being shared by many

different source-destination IP prefix pairsdifferent source-destination IP prefix pairs

sets usually contain a small number of entriessets usually contain a small number of entries

Page 28: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 28 • Communications TechnologyCommunications TechnologyLabLab

Toward Two Stage Packet ClassificationToward Two Stage Packet Classification

IP Prefix Pair m

IP Prefix Pair 3

IP Prefix Pair 2

IP Prefix Pair 1

Src. IP

Dst. IP

packetfirst stage finds themost specific filter

for the packet

Set #1

Set #2

Set #k

Second stage obtainsa list of pointers to

shared sets of transport level fields

HW accelerator

intersection

Page 29: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 29 • Communications TechnologyCommunications TechnologyLabLab

Observations on IP prefix PairsObservations on IP prefix Pairs

IP prefix pairs are of 2 typesIP prefix pairs are of 2 types partially-specified filters (i.e., (*,X) or (Y, *))partially-specified filters (i.e., (*,X) or (Y, *))

fully-specified filtersfully-specified filters

partially specified filters partially specified filters are a small fraction (< 25%) of IP prefix pairsare a small fraction (< 25%) of IP prefix pairs

most fully-specified filters (> 80%) are most fully-specified filters (> 80%) are represented by represented by segments of straight linessegments of straight lines

pointspoints

Page 30: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 30 • Communications TechnologyCommunications TechnologyLabLab

IP Prefix Pair OverlapsIP Prefix Pair Overlaps

n2/4 amount of overlaps

filter structure with O(n2) overlaps realistic filter structure

(*, *)

cluster #1

(*, X)

(Y, *)

cluster #2

cluster #m

realistic amount of overlaps ?

Page 31: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 31 • Communications TechnologyCommunications TechnologyLabLab

Visualizing ACLs with our FilterViewer ToolVisualizing ACLs with our FilterViewer Tool

Page 32: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 32 • Communications TechnologyCommunications TechnologyLabLab

Partial Filter Overlaps in the Realistic Filter StructurePartial Filter Overlaps in the Realistic Filter Structure

Breakdown of Overlaps

observed number of overlaps

theoretical worst case

overlaps formed by partially specified filters only

overlaps formed by

fully specified

filters only

overlaps formed

between partially and fully specified

filters

ACL1 4 90,525 100% 0% 0%

ACL2 2249 138,601 45% 4% 51%

ACL3 6138 1,260,078 88% 1% 11%

ACL4 852 12,246 100% 0% 0%

Page 33: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 33 • Communications TechnologyCommunications TechnologyLabLab

Why Few Filter OverlapsWhy Few Filter Overlaps Partially specified filters represent a small Partially specified filters represent a small

fraction of the total number of filters in fraction of the total number of filters in databasesdatabases

Fully specified filters create an insignificant Fully specified filters create an insignificant amount of overlapsamount of overlaps

There is a bounded number of important There is a bounded number of important servers per IP address domainservers per IP address domain

Page 34: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 34 • Communications TechnologyCommunications TechnologyLabLab

Observations on Transport Level FieldsObservations on Transport Level Fields

relative priority of fields is the same in different occurrences of the field set

Page 35: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 35 • Communications TechnologyCommunications TechnologyLabLab

Transport Level SharingTransport Level Sharing

number of rulesnumber of unique sets of transport

level fields

number of entries in unique sets of transport level

fields

ACL1 754 102 316

ACL2 607 35 68

ACL3 2399 186 437

ACL4 157 8 47

Page 36: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 36 • Communications TechnologyCommunications TechnologyLabLab

ImplicationsImplications Classification can be split into 2 stages:Classification can be split into 2 stages:

Stage 1: IP address fieldsStage 1: IP address fields

Stage 2: transport level fieldsStage 2: transport level fields

A design that returns the smallest filter A design that returns the smallest filter intersection is viable intersection is viable since the amount of overlaps between IP prefix pairs is since the amount of overlaps between IP prefix pairs is

smallsmall

Searching through transport level fields can be Searching through transport level fields can be accelerated by hardwareaccelerated by hardware

Page 37: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

®

PART IV: Two Stage Packet Classification Using Most Specific Filter Matching and Transport Level Sharing

PART IV: Two Stage Packet Classification Using Most Specific Filter Matching and Transport Level Sharing

Page 38: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 38 • Communications TechnologyCommunications TechnologyLabLab

Cross Producting RevisitedCross Producting Revisited

source IP address

destination IP address

132.59.255.255

132.59.10.0

132.59.10.255

128.67.0.0

128.67.32.0

128.67.255.255

128.67.32.255

121.45.5.0

121.45.5.255

125.12.12.255

125.12.12.0

cross product =(128.67.32.*, 132.59.10.*)

132.59.0.0

(128.67.32.*, 121.45.5.*) packet =(128.67.32.5, 132.59.10.10)

most specific filter =(128.67.*, 132.59.*)

(125.12.12.*, 132.59.10.*)

Cross Producting may return a non existent filter. To address this issue Cross

Producting adds all possible filters returned from LPM searches into its lookup table

Page 39: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 39 • Communications TechnologyCommunications TechnologyLabLab

Definition of a Cross ProductDefinition of a Cross Product

A filter with: A filter with: a source IP prefix equal to the prefix of a filter a source IP prefix equal to the prefix of a filter

FF11 from a database; and from a database; and

a destination IP prefix equal to the prefix of a destination IP prefix equal to the prefix of another filter another filter FF22 from the same database from the same database

FF22 is not necessarily equal to is not necessarily equal to FF11

Page 40: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 40 • Communications TechnologyCommunications TechnologyLabLab

Improving Cross ProductingImproving Cross Producting Cross Producting is fast but…Cross Producting is fast but…

Memory explosionMemory explosion

For ACL3 there are 431 src. IP prefixes and 517 dst. For ACL3 there are 431 src. IP prefixes and 517 dst. Prefixes, hence 222,396 cross productsPrefixes, hence 222,396 cross products

SolutionSolution We can remove 70%-80% of the cross products with We can remove 70%-80% of the cross products with

little penalty to the performance of the classifierlittle penalty to the performance of the classifier Not covered cross productsNot covered cross products

Partially covered productsPartially covered products

Page 41: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 41 • Communications TechnologyCommunications TechnologyLabLab

Motivating ExampleMotivating Example

A (*, *)

B

C

E

R1

R 2

R3

R4

R5R 6 R 7

121.45.5.0

121.45.5.255

132.59.10.0 255.255.255.255

128.67.0.0

128.67.32.255

125.12.12.255

125.12.12.0

128.67.32.0

128.67.255.255

147.101.10.0

147.101.10.255

255.255.255.255

132.59.10.255

132.59.255.255

D

132.59.0.0

do we need to store all

cross products R1-R7?

Page 42: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 42 • Communications TechnologyCommunications TechnologyLabLab

Not Covered Cross ProductsNot Covered Cross Products

source IP address

destination IP address

132.59.10.0132.59.10.255

128.67.32.0

128.67.32.255

121.45.5.0121.45.5.255

125.12.12.255

125.12.12.0

(125.12.12.*, 132.59.10.*)

(128.67.32.*, 121.45.5.*)

not covered cross product =(125.12.12.*, 121.45.5.*)

‘not covered’ cross products are only covered by (*,*). Hence they can be removed from the lookup table.

If no match is found the algorithm returns (*, *)

Page 43: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 43 • Communications TechnologyCommunications TechnologyLabLab

Partially Covered Cross ProductsPartially Covered Cross Products

‘partially covered’ cross products are only covered by (X,*) or (*, Y). Hence they can be removed from the

lookup table. If no match is found, the algorithm checks a database of partially specified filters. If no

match is found again the algorithm returns (*,*)

128.67.32.255

125.12.12.255

source IP address

destination IP address

132.59.10.0132.59.10.255

128.67.32.0

121.45.5.0121.45.5.255

125.12.12.0

(128.67.32.*, 121.45.5.*)

(125.12.12.*, 132.59.10.*)

partially covered cross product =(125.12.12.*, 121.45.5.*)

255.255.255.255

(125.12.12.*,*)

Page 44: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 44 • Communications TechnologyCommunications TechnologyLabLab

Fully Covered Cross ProductsFully Covered Cross Products

those cross products which are neither ‘not those cross products which are neither ‘not covered’ nor ‘partially covered’covered’ nor ‘partially covered’ fully specified filtersfully specified filters

filter intersections that are fully specifiedfilter intersections that are fully specified

filters which are: filters which are: formed by combining the source and destination formed by combining the source and destination

IP prefixes of different IP prefix pairs; and IP prefixes of different IP prefix pairs; and

contained into fully-specified filters or fully-contained into fully-specified filters or fully-specified filter intersectionsspecified filter intersections

these are called ‘indicator filters’ these are called ‘indicator filters’

Page 45: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 45 • Communications TechnologyCommunications TechnologyLabLab

Most Specific Filter MatchingMost Specific Filter Matching

index 1 index 2

index 1 index 2

lookup on atable of

fully coveredcross products

index 1 index 2

index 1 index 2

primary table

index 3

secondary table of filters

of the form(X, *)

index 4

LPM lookupon the

source IPaddress

LPM lookupon the

destination IPaddress

secondary table of filters

of the form(*, Y)

Page 46: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 46 • Communications TechnologyCommunications TechnologyLabLab

Back to the ExampleBack to the Example

A (*, *)

B

C

E

R1

R 2

R3

R4

R5R 6 R 7

121.45.5.0

121.45.5.255

132.59.10.0 255.255.255.255

128.67.0.0

128.67.32.255

125.12.12.255

125.12.12.0

128.67.32.0

128.67.255.255

147.101.10.0

147.101.10.255

255.255.255.255

132.59.10.255

132.59.255.255

D

132.59.0.0

cross products R1, R3-R7 can

be removed from primary

table

Page 47: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 47 • Communications TechnologyCommunications TechnologyLabLab

Transport Level SharingTransport Level Sharing Rules that specify the same source-destination IP Rules that specify the same source-destination IP

prefix pair are consecutiveprefix pair are consecutive

These rules share sets of transport level fieldsThese rules share sets of transport level fields

Most specific filter matching returns a list of Most specific filter matching returns a list of pointers to share sets of transport level fields pointers to share sets of transport level fields

Packet classification in the transport level Packet classification in the transport level dimensions is done in hardwaredimensions is done in hardware

Page 48: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 48 • Communications TechnologyCommunications TechnologyLabLab

Hardware AccelerationHardware Accelerationapproach: we do not need to expand a range into prefixes

• We use a pair of comparators• We compare a key with upper and lower bounds

bin #1(e.g., 13 entries)

bin #2(e.g., 4 entries)

bin #3(e.g., 7 entries)

2Kfast memory

entries

bin #1(e.g., 13 entries)

bin #2(e.g., 4 entries)

bin #3(e.g., 7 entries)

entries

16 entry comparator

circuit custom hardware

Stage 2 Lookup

comparator

… …

key (16 bits) lower bound

… …

key upper bound

borrow

borrow

“match”

comparator

comparator

… …

key (16 bits) lower bound

… …

key upper bound

borrow

borrow

“match”

comparator

Range matching circuit for a field

key shifter & mask

16 bits 16 bitrange matching

key shifter & mask

16 bits 16 bitrange matching

key shifter & mask

Regular TCAMcheck

“match”

done in softwarekey shifter &

mask16 bits 16 bit

range matching

key shifter & mask

16 bits 16 bitrange matching

key shifter & mask

16 bitsRegular TCAM

check

“match”

Custom Hardware Design for an entry

Page 49: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 49 • Communications TechnologyCommunications TechnologyLabLab

PerformancePerformance Lookup timeLookup time

Small and predictable number of steps independent of Small and predictable number of steps independent of the number of rulesthe number of rules

11 memory accesses11 memory accesses

Memory SpaceMemory Space Reasonable: 19-446 KB (for ACLs 1-4 without HW Reasonable: 19-446 KB (for ACLs 1-4 without HW

acceleration)acceleration)

Memory Access BWMemory Access BW 64 words/access (without HW acceleration)64 words/access (without HW acceleration) 4 words/access (with HW acceleration)4 words/access (with HW acceleration)

Update TimeUpdate Time Approximately 197,000 memory accesses Approximately 197,000 memory accesses

Page 50: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

®

Tutorial Summary and ConclusionTutorial Summary and Conclusion

Page 51: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 51 • Communications TechnologyCommunications TechnologyLabLab

Tutorial Summary (I)Tutorial Summary (I)

Packet ClassificationPacket Classification A complex open problemA complex open problem

State of the artState of the art Existing schemes trade-off the lookup time or Existing schemes trade-off the lookup time or

the memory requirementthe memory requirement

Page 52: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 52 • Communications TechnologyCommunications TechnologyLabLab

Tutorial Summary (II)Tutorial Summary (II) Observations on real world classifiersObservations on real world classifiers

Few partial overlaps between IP prefix pairsFew partial overlaps between IP prefix pairs

Shared sets of transport level fieldsShared sets of transport level fields

Proposed a new schemeProposed a new scheme Exploits classifier propertiesExploits classifier properties

Predictable lookup time with reasonable Predictable lookup time with reasonable memory requirementmemory requirement

Requires HW acceleration Requires HW acceleration

Page 53: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 53 • Communications TechnologyCommunications TechnologyLabLab

Future WorkFuture Work

Verify the scheme using more data setsVerify the scheme using more data sets

Simplify the update processSimplify the update process

Apply other fast solutions to the first stageApply other fast solutions to the first stage HyperCuttsHyperCutts

Page 54: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

®

Packet SchedulingPacket Scheduling

Page 55: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 55 • Communications TechnologyCommunications TechnologyLabLab

Tutorial SummaryTutorial Summary PART I: PART I:

Understanding the problemUnderstanding the problem

PART II: PART II: State-of-the-artState-of-the-art

PART III: PART III: Sorting packets by packet schedulers using the Sorting packets by packet schedulers using the

Connected Trie data structureConnected Trie data structure

PART IV: PART IV: Building a four level, OC-48, programmable Building a four level, OC-48, programmable

hierarchical packet schedulerhierarchical packet scheduler

Page 56: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

®

PART I: Understanding the ProblemPART I: Understanding the Problem

Page 57: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 57 • Communications TechnologyCommunications TechnologyLabLab

The Concept of QoSThe Concept of QoS Packet networks Packet networks

Usually provide best effort servicesUsually provide best effort services

Can we make packet networks capable of Can we make packet networks capable of delivering continuous media?delivering continuous media?

Key concept: make packet networks flow aware Key concept: make packet networks flow aware

MechanismsMechanisms SchedulingScheduling

ShapingShaping

Resource ReservationResource Reservation

Admission ControlAdmission Control

Page 58: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 58 • Communications TechnologyCommunications TechnologyLabLab

Packet SchedulingPacket Scheduling

trafficshaper

head-of-linepackets

scheduler

session 1

session N

delay jitter

burst

shaped traffic

Page 59: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 59 • Communications TechnologyCommunications TechnologyLabLab

Generalized Processor Sharing (GPS)Generalized Processor Sharing (GPS)

Ideal scheduling disciplineIdeal scheduling discipline Visits each nonempty queue and serves an Visits each nonempty queue and serves an

infinitesimally small amount of datainfinitesimally small amount of data

Supports exact max-min fair share allocationSupports exact max-min fair share allocation Resources are allocated in order of increasing demandResources are allocated in order of increasing demand

No source gets a resource share larger than its demandNo source gets a resource share larger than its demand

Sources with unsatisfied demands get an equal share of Sources with unsatisfied demands get an equal share of the resourcethe resource

Page 60: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 60 • Communications TechnologyCommunications TechnologyLabLab

Weighted Fair Queuing (WFQ)Weighted Fair Queuing (WFQ)

Key idea: If you can’t implement GPS Key idea: If you can’t implement GPS simulate it on the sidesimulate it on the side

The algorithm:The algorithm: Tag packets with numbers denoting the order Tag packets with numbers denoting the order

of completion of service according to the of completion of service according to the simulated GPS disciplinesimulated GPS discipline

Transmit the packets in the ascending order of Transmit the packets in the ascending order of their tagstheir tags

Page 61: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 61 • Communications TechnologyCommunications TechnologyLabLab

Time Stamp CalculationTime Stamp Calculation

tag of the next packet in the queue

= MAX(tag of the

previous packet in the queue

round number of the simulated

GPS service, ) +

packet size

connection weight

most difficultto determine

Page 62: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 62 • Communications TechnologyCommunications TechnologyLabLab

ExampleExample

6 8

2

4

0slope = 1/3

slope = 1

round number ofthe simulated GPS

discipline

time

4 2 2WFQ

3 connections of equal weights

three packets of sizes 2, 2 and 4units arrive concurrently

capacity = 1unit/sec

2

2

4

packets of size 2complete transmission

packet of size 4completes

transmission

Page 63: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 63 • Communications TechnologyCommunications TechnologyLabLab

Relative Fairness BoundRelative Fairness Bound

Relative Fairness

Bound= MAX |

service received by connection Aduring an interval

rate allocated to A

service received by connection B

during this interval

rate allocated to B

- |

A, B = backlogged connections

Page 64: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 64 • Communications TechnologyCommunications TechnologyLabLab

Absolute Fairness BoundAbsolute Fairness Bound

AbsoluteFairness

Bound= MAX |

service received by connection Aduring an interval

rate allocated to A

service received by connection A

during this intervalif serviced by GPS

rate allocated to A

- |

A = backlogged connection

Page 65: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 65 • Communications TechnologyCommunications TechnologyLabLab

Hierarchical Packet SchedulingHierarchical Packet Scheduling

Single level schedulingSingle level scheduling Transmission order does not depend on future arrivalsTransmission order does not depend on future arrivals

Hierarchical schedulingHierarchical scheduling Transmission order depends on future arrivalsTransmission order depends on future arrivals

To implement hierarchical GPS we need to built a hierarchy of single To implement hierarchical GPS we need to built a hierarchy of single level fair queuing disciplines level fair queuing disciplines

Page 66: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 66 • Communications TechnologyCommunications TechnologyLabLab

Schedulable RegionSchedulable Region Set of all possible combinations of Set of all possible combinations of

performance bounds a scheduler can performance bounds a scheduler can simultaneously meetsimultaneously meet

Class I

Class IIClass III

Page 67: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

®

PART II: State-of-ArtPART II: State-of-Art

Page 68: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 68 • Communications TechnologyCommunications TechnologyLabLab

Tagging and Sorting SchemesTagging and Sorting Schemes

Demers, Keshav, ShenkerWFQ

1989 1991

Lazar, Hyman, PacificiMARS

1993 1995

Goyal, Vin, ChengStart Time Fair QueingBennet, Stephens, ZhangCMU Sorting SchemeRexford,Bonomi, Greenberg AT&T Sorting Scheme

1996 2003

Ramabhadran, PasqualeStratified Round Robin

Kounavis, Kumar, YavatkarConnected Trie Data Structure

2004

ValenteExact GPS Simulationwith logarithmiccomplexity

Parekh, GallagerGeneralized Processor SharingLazar, Hyman, PacificiSchedulable Region

GolestaniSelf Clocked Fair Queing

Shreedhar, Varghese Deficit Round Robin

1994

Page 69: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 69 • Communications TechnologyCommunications TechnologyLabLab

Self-Clocked Fair Queuing (SCFQ)Self-Clocked Fair Queuing (SCFQ)

round number of the simulated

GPS service

finish time of the packet currently

in service =

Same as WFQ apart from:

Main disadvantage: large end-to-end delay

Page 70: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 70 • Communications TechnologyCommunications TechnologyLabLab

Start Time Fair Queuing (SFQ)Start Time Fair Queuing (SFQ)

round number of the simulated

GPS service

start time of the packet currently

in service =

Transmission order: Ascending order of start timesSame end-to-end delay as WFQ

Same as WFQ apart from:

Page 71: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 71 • Communications TechnologyCommunications TechnologyLabLab

W2FQW2FQ

transmission order

select the packet with the minimum tag from among those that have alreadystarted service in the corresponding

GPS simulation

=

Smaller Absolute Fairness Bound

Same as WFQ apart from:

Page 72: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 72 • Communications TechnologyCommunications TechnologyLabLab

Round Robin SchedulingRound Robin Scheduling

Deficit Round RobinDeficit Round Robin Each connection has a deficit counterEach connection has a deficit counter

Every round the deficit counter is incremented Every round the deficit counter is incremented by a quantumby a quantum

If the packet size < counter then the packet is If the packet size < counter then the packet is transmitted and the counter is reduced by the transmitted and the counter is reduced by the packet sizepacket size

Page 73: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 73 • Communications TechnologyCommunications TechnologyLabLab

Exact GPS Simulation with Logarithmic Complexity Exact GPS Simulation with Logarithmic Complexity

L-GPS simulates GPS with minimum deviation of L-GPS simulates GPS with minimum deviation of one packet size at one packet size at OO(log(logNN) complexity) complexity All other well known schedulers that accomplish the All other well known schedulers that accomplish the

same deviation (e.g., Wsame deviation (e.g., W22FQ) have O(FQ) have O(NN) complexity) complexity

Key idea:Key idea: L-GPS pre-computes the evolution of the round number L-GPS pre-computes the evolution of the round number

function of the simulated GPS service using a tree function of the simulated GPS service using a tree structurestructure

Page 74: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 74 • Communications TechnologyCommunications TechnologyLabLab

Some Sorting Data StructuresSome Sorting Data Structures

HeapsHeaps

Binomial heapsBinomial heaps

Calendar QueuesCalendar Queues

Van Emde Boas TreesVan Emde Boas Trees

Trees of ComparatorsTrees of Comparators

CMU SortingCMU Sorting

AT&T SortingAT&T Sorting

Polytechnic Institute SortingPolytechnic Institute Sorting

Page 75: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 75 • Communications TechnologyCommunications TechnologyLabLab

The Tree of ComparatorsThe Tree of ComparatorsYou divide packets into groups

You send each group to a stage of comparators

You obtain a minimum from each comparator

You pass the minima into a second stage of comparators

You repeat the process until one packet remains

second stage of

comparators

final stage of

comparators

packet with minimum tag

group of head-of-line

packets

line group of

head-of-packets

packet in group with

minimum tag

first stage of

comparators

comparator

Page 76: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 76 • Communications TechnologyCommunications TechnologyLabLab

The Sorting Scheme from AT&TThe Sorting Scheme from AT&T

FIFO 1

FIFO 2

FIFO k

Bin 1

Bin 2

Bin m

Connection FIFOs Sorting Bins

scheduling horizon

Page 77: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 77 • Communications TechnologyCommunications TechnologyLabLab

The Sorting Scheme from Polytechnic Inst. Brooklyn The Sorting Scheme from Polytechnic Inst. Brooklyn

Sorting is supported by a hierarchy of bit vectors

Range of time stamp values

Page 78: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

®

PART III: Sorting Packets by Packet Schedulers Using the Connected Trie Data Structure

PART III: Sorting Packets by Packet Schedulers Using the Connected Trie Data Structure

Page 79: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 79 • Communications TechnologyCommunications TechnologyLabLab

ContributionContribution We propose a sorting algorithm and data structure that We propose a sorting algorithm and data structure that

reduces the latency of making scheduling decisions to a reduces the latency of making scheduling decisions to a singe memory access timesinge memory access time Solution is applicable to SCFQ, SFQSolution is applicable to SCFQ, SFQ

Key ObservationKey Observation Increments on packet time stamps are between the range: Increments on packet time stamps are between the range:

(maximum packet size)/(minimum weight)(maximum packet size)/(minimum weight) This is called the scheduling horizonThis is called the scheduling horizon

ApproachApproach We represent the scheduling horizon as a trieWe represent the scheduling horizon as a trie We put state into the nodes of the trie to allow the leaves to We put state into the nodes of the trie to allow the leaves to

be connected into linked list be connected into linked list

Page 80: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 80 • Communications TechnologyCommunications TechnologyLabLab

Trie-based OrderingTrie-based Ordering

trie structure of heighth = log(scheduling horizon/region width)

0

scheduling horizon =

maximum packet size

minimumweight

10

0 1 1

region of time stampsfor packet A

trie traversal for packet A

region of time stampsfor packet B

trie traversalfor packet B

Page 81: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 81 • Communications TechnologyCommunications TechnologyLabLab

Van Emde Boas Trees and the Connecting TrieVan Emde Boas Trees and the Connecting Trie

linear traversal binary traversal(Van Emde Boas Tree)

optimal traversallinear traversal optimal traversal(Connected Trie)

Page 82: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 82 • Communications TechnologyCommunications TechnologyLabLab

Main ConceptsMain Concepts Each node stores: Each node stores:

a pointer to the rightmost leaf a pointer to the rightmost leaf

– with the highest value from among those found by traversing the left child of the node

a pointer to the leftmost leaf a pointer to the leftmost leaf

– with the lowest value from among those found by traversing the right child of the node

When a new packet is added into the trieWhen a new packet is added into the trie

– The algorithm ‘discovers’ the rightmost and leftmost leaves the new packet should be connected to

– The new packet is inserted into a linked list of leaves

Hence the next packet for transmission into the network is Hence the next packet for transmission into the network is found in a single memory access timefound in a single memory access time

Page 83: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 83 • Communications TechnologyCommunications TechnologyLabLab

Example: Root onlyExample: Root only

R (-∞, +∞)

Page 84: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 84 • Communications TechnologyCommunications TechnologyLabLab

Example: Adding 13Example: Adding 13

R (-∞, 13)

NULL

C

B

A

13

(13, +∞)

(-∞, 13)

(-∞, 13)

1

1

1

0

NULL

Page 85: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 85 • Communications TechnologyCommunications TechnologyLabLab

Example: Adding 5Example: Adding 5

C

B

D

E

F

5

R

A

13

(5, 13)

(13, +∞)

(-∞, 13)

(-∞, 13)

(-∞, 5)

(5, 13)

(-∞, 5)

NULL NULL

0

0

1

1

1

1

1

0

Page 86: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 86 • Communications TechnologyCommunications TechnologyLabLab

Example: Adding 10Example: Adding 10

C

B

10

H

D

E

F

5

R

A

G

13

(5, 10)

(5, 10) (13, +∞)

(-∞, 13)

(10, 13)

(10, 13)(-∞, 5)

(5, 13)

(-∞, 5)

NULL NULL

0

0

1

1

1

1

1

1

0

0

0

Page 87: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 87 • Communications TechnologyCommunications TechnologyLabLab

Example: Adding 8Example: Adding 8

C

B

10

H

8

D

E

F

5

R

A

I

G

13

(5, 8)

(8, 10) (13, +∞)

(-∞, 13)

(10, 13)

(10, 13)(8, 10)(-∞, 5)

(5, 13)

(-∞, 5)

NULL NULL

0

0

1

1

1

1

1

1

0

0

0 0

0

Page 88: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 88 • Communications TechnologyCommunications TechnologyLabLab

The Node Traversal AlgorithmThe Node Traversal Algorithm

visit a node

update the information about the leftmost and rightmost leaves where

the new element should be connected to

if the new element is greater (lower) than the rightmost (leftmost) leaf of the node, update the state

on the node

visit the next node

Page 89: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 89 • Communications TechnologyCommunications TechnologyLabLab

Using Two Tries at a TimeUsing Two Tries at a Time Why two tries at a time:Why two tries at a time:

Let’s assume Let’s assume D D is the scheduling horizonis the scheduling horizon

During the transmission of a packet, new packets will be During the transmission of a packet, new packets will be associated with time stamp increments at most associated with time stamp increments at most DD..

During the transmission of these packets, new packets will be During the transmission of these packets, new packets will be associated with time stamp increments at most 2associated with time stamp increments at most 2DD..

Trie 1 Trie 2 Trie 3

once all packets in Trie 1are transmitted we reuse the memory

and start building Trie 3 a.s.o

Trie 4

Page 90: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 90 • Communications TechnologyCommunications TechnologyLabLab

Optimal Height of the Connected TrieOptimal Height of the Connected Trie

optimal height

= log(

least common multiple of weights

greatest commondivisor of packet

sizes

maximum packet size

minimum weight

X )

Page 91: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 91 • Communications TechnologyCommunications TechnologyLabLab

PerformancePerformance Scheduling decision timeScheduling decision time

Exactly 1 memory access independent of the number of Exactly 1 memory access independent of the number of flows in the schedulerflows in the scheduler

Insertion timeInsertion time 6 read accesses, 7 write accesses for a trie of height 12.6 read accesses, 7 write accesses for a trie of height 12.

Memory access bandwidthMemory access bandwidth 10 words per read access, 6 words per write access for a 10 words per read access, 6 words per write access for a

trie of height 12trie of height 12

Memory requirementMemory requirement 34KB for 256 connections, 213 KB for 64K connections, 34KB for 256 connections, 213 KB for 64K connections,

for a trie of height 12 for a trie of height 12

Page 92: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

®

PART IV: Building a Four Level, OC-48, Programmable Hierarchical Packet Scheduler

PART IV: Building a Four Level, OC-48, Programmable Hierarchical Packet Scheduler

Page 93: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 93 • Communications TechnologyCommunications TechnologyLabLab

ContributionContribution Efficient implementation of hierarchical scheduling on Efficient implementation of hierarchical scheduling on

the IXP2xxx series and the next generation the IXP2xxx series and the next generation processorsprocessors Support for OC48 on IXP28xxSupport for OC48 on IXP28xx

Budget of 228 cycles/packet for scheduling only!Budget of 228 cycles/packet for scheduling only!

Support for multiple levels of hierarchy upto 5 levelsSupport for multiple levels of hierarchy upto 5 levels

Support for a total of <= 256K input queues with Support for a total of <= 256K input queues with arbitrary weights at each levelarbitrary weights at each level

Any possible configuration of the hierarchies should be Any possible configuration of the hierarchies should be supportedsupported

Page 94: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 94 • Communications TechnologyCommunications TechnologyLabLab

Examples of SchedulersExamples of Schedulers

Egress Scheduler Configuration

SCFQ

SCFQ

… SCFQ

Fabric

Level 1

Level 00

3

SCFQ

SCFQ

SCFQ

SCFQ

0

256

256

0

Level 2

0

256

pipes

0

256

pipes

262144 pipestotal

SCFQ

SCFQ

… SCFQ

Fabric

Level 1

Level 00

3

SCFQ

SCFQ

SCFQ

SCFQ

0

256

256

0

Level 2

0

256

pipes

0

256

pipes

total

0

15

… WRR

0

15

… WRR

VOQs

VOQsRR

Fabric Port 0

Fabric Port 255

Fabric 0

4095

… WRR

Fabric

VOQs

Ingress Scheduler Configurations

0

7

… SCFQ

0

7

… SCFQ

pipes

pipesSCFQ

Level 5

Level 1 0

262143

…SCFQ

Fabric

pipes

Egress Scheduler Configurations

32K pipestotal

0

15

… WRR

0

15

… WRR

VOQs

VOQsRR

Fabric Port 0

Fabric Port 255

Fabric 0

4095

… WRR

Fabric

VOQs

0

7

… SCFQ

0

7

… SCFQ

pipes

pipesSCFQ

Level 5

0

262143

…SCFQ

Fabric

pipes…

total

Page 95: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 95 • Communications TechnologyCommunications TechnologyLabLab

ApproachApproach Supporting many single level schedulers (e.g., Supporting many single level schedulers (e.g.,

64K) using a limited number of microengines and 64K) using a limited number of microengines and threadsthreads

We assign a small number of threads to serve each entire We assign a small number of threads to serve each entire level of the hierarchy as opposed to a single scheduler level of the hierarchy as opposed to a single scheduler only! only!

packets are exchanged between levels at line ratepackets are exchanged between levels at line rate

Bandwidth sharing takes place at different levels in Bandwidth sharing takes place at different levels in parallel!parallel!

Sorting of PacketsSorting of Packets We use the connected trie and tree of comparators We use the connected trie and tree of comparators

structures structures

Page 96: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 96 • Communications TechnologyCommunications TechnologyLabLab

Parallelizing the Hierarchical SchedulerParallelizing the Hierarchical Scheduler We assign a small number of threads to serve each We assign a small number of threads to serve each

level of the hierarchylevel of the hierarchy We can do that because packets are exchanged between We can do that because packets are exchanged between

levels at line ratelevels at line rate

We insert pre-sorted packets at each level We insert pre-sorted packets at each level We can do that because hierarchical schedulers consist of We can do that because hierarchical schedulers consist of

independent single level schedulers independent single level schedulers

We parallelize the dequeue processing at each of the We parallelize the dequeue processing at each of the levelslevels

A dequeue thread creates a hole (i.e., empty packet space)A dequeue thread creates a hole (i.e., empty packet space)

A hole filling thread at the next level fills the hole by A hole filling thread at the next level fills the hole by inserting a new packetinserting a new packet

Page 97: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 97 • Communications TechnologyCommunications TechnologyLabLab

High Level Design from First Order PrinciplesHigh Level Design from First Order Principles

Fact, Assumption or Design Principle

Consequence High-Level Design Guideline

a hierarchical scheduler consists of single-level schedulers

we can insert presorted packets at each level

we address the sorting problem locally at each single level scheduler

packets need to be exchanged between levels at line rate

we can assign a small number of threads to serve each entire level of the

hierarchy

the levels of the hierarchy can operate in parallel independent of

each other

each packet transmitted at a level creates an empty packet space, which

we call a ‘hole’

to fill a hole you may need to perform at least one SRAM access which can be as

large as 300 compute cycles

we need to buffer more than one packet at each level of the scheduler

the enqueuing process may insert packets into single-level schedulers

at any level

enqueuing and hole-filling threads may need to access the same state

information concurrently

mutual exclusion techniques are required

calculating a virtual time function is complex

virtual time can be approximated by the finish time of the packet currently in

service (SCFQ)

a packet entering a scheduler does not need to preempt the packet

currently in service

Page 98: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 98 • Communications TechnologyCommunications TechnologyLabLab

Parallelized Hierarchical Scheduler IllustratedParallelized Hierarchical Scheduler Illustrated

line rate

line rateline rate

line rate

level 1

threads

level 1

threadsthreads

level

threads

level 2

threadsthreads

leaf level

threads

leaf level

threadsthreads

root level

threadsthreadsthreads

SCFQ

SCFQ

SCFQ

SCFQ

SCFQ

SCFQ

SCFQ

SCFQ

SCFQ

SCFQ

SCFQ

SCFQ

SCFQ

single levelschedulers

are represented by state in

SRAM memory

communicationacross levels is donethrough local memory

or next neighbor registers

Page 99: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 99 • Communications TechnologyCommunications TechnologyLabLab

Meeting the OC-48 Line Rate with BufferingMeeting the OC-48 Line Rate with Buffering

Final Output

Why buffering?Why buffering? To cope for the fact that a single SRAM access may take more time 228 cyclesTo cope for the fact that a single SRAM access may take more time 228 cycles

Enqueue ProcessEnqueue Process We maintain We maintain two minimum tagtwo minimum tag packets at the output of each scheduler packets at the output of each scheduler

Dequeue ProcessDequeue Process While a hole is filled, the next packet While a hole is filled, the next packet is ready to be servicedis ready to be serviced

SCFQ

SCFQ

SCFQ

Level 0

Level 1Head of line packets

Input queues– usehardware queues

A hole filling threadstarts inserting a new packet

in the root scheduler

a packet is being transmitted

a second packetbuffered at the root is

transmitted

Page 100: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 100 • Communications TechnologyCommunications TechnologyLabLab

Hierarchical Scheduler PrototypedHierarchical Scheduler Prototyped

Egress Scheduler Configuration

SCFQ

SCFQ

… SCFQ

Fabric

Level 1

Level 00

3

SCFQ

SCFQ

SCFQ

SCFQ

0

256

256

0

Level 2

0

256

pipes

0

256

pipes

262144 pipestotal

SCFQ

SCFQ

… SCFQ

Fabric

Level 1

Level 00

3

SCFQ

SCFQ

SCFQ

SCFQ

0

256

256

0

Level 2

0

256

pipes

0

256

pipes

total

0

15

… WRR

0

15

… WRR

VOQs

VOQsRR

Fabric Port 0

Fabric Port 255

Fabric 0

4095

… WRR

Fabric

VOQs

Ingress Scheduler Configurations

0

7

… SCFQ

0

7

… SCFQ

pipes

pipesSCFQ

Level 5

Level 1 0

262143

…SCFQ

Fabric

pipes

Egress Scheduler Configurations

32K pipestotal

0

15

… WRR

0

15

… WRR

VOQs

VOQsRR

Fabric Port 0

Fabric Port 255

Fabric 0

4095

… WRR

Fabric

VOQs

0

7

… SCFQ

0

7

… SCFQ

pipes

pipesSCFQ

Level 5

0

262143

…SCFQ

Fabric

pipes…

total

0

7

… SCFQ

0

7

… SCFQ

pipes

pipesSCFQ

Level 5

Level 1

32K connectionstotal

Page 101: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 101 • Communications TechnologyCommunications TechnologyLabLab

Use of the IXP MicroenginesUse of the IXP Microengines

HF2BHF3BQMIB

…………

level 1level 2level 3level 4 level 0(root)

HF0HF1AHF2AHF3AQMIA

DT

QueueManager

HF1B

hole-fillingthreads

EN0A

EN0B

threadsserving levels

0 and 1

EN1A

EN1B

threadsserving levels

2 and 3

EN2A

EN2B

threadsserving level

4

enqueuingthreads

ME0ME1

ME2

Page 102: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 102 • Communications TechnologyCommunications TechnologyLabLab

RemarksRemarks Four level OC-48 line rate forwarding (2.5 Gbps)Four level OC-48 line rate forwarding (2.5 Gbps)

Data structures fit into the local memory and Data structures fit into the local memory and SRAM of IXP2400SRAM of IXP2400

We keep the memory access bandwidth We keep the memory access bandwidth consumption at reasonable level consumption at reasonable level We fetch either the heads or tails of queues but not both We fetch either the heads or tails of queues but not both

at the same timeat the same time

We employ novel inter-thread synchronization We employ novel inter-thread synchronization algorithmsalgorithms

Page 103: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

®

Tutorial Summary and ConclusionTutorial Summary and Conclusion

Page 104: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 104 • Communications TechnologyCommunications TechnologyLabLab

Tutorial Summary (I)Tutorial Summary (I) Packet schedulingPacket scheduling

Critical component of router datapathsCritical component of router datapaths

Generalized Processor SharingGeneralized Processor Sharing Ideal service (non-implementable)Ideal service (non-implementable)

Real implementationsReal implementations Annotate packets with time stampsAnnotate packets with time stamps

Sort packets according to their time stamp Sort packets according to their time stamp valuesvalues

Page 105: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 105 • Communications TechnologyCommunications TechnologyLabLab

Tutorial Summary (II)Tutorial Summary (II) We propose a new algorithm for sorting We propose a new algorithm for sorting

packetspackets Reduces the latency of making scheduling Reduces the latency of making scheduling

decisions to a single memory access timedecisions to a single memory access time

Parallelized Processor Architectures like Parallelized Processor Architectures like IXP2xxx are suitable for implementing IXP2xxx are suitable for implementing packet scheduling in softwarepacket scheduling in software

Page 106: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 106 • Communications TechnologyCommunications TechnologyLabLab

Future WorkFuture Work

Use the Connected Trie for implementing Use the Connected Trie for implementing disciplines other than SFQ, SCFQdisciplines other than SFQ, SCFQ Example: L-GPS?Example: L-GPS?

Use the Connected Trie in multi-level Use the Connected Trie in multi-level hierarchical scheduler configurations hierarchical scheduler configurations

Page 107: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 107 • Communications TechnologyCommunications TechnologyLabLab

Thanks for ListeningThanks for Listening

Page 108: ® Line Rate Packet Classification and Scheduling Michael Kounavis (Intel) Alok Kumar (Intel) Raj Yavatkar (Intel) Harrick Vin (U. Texas, Austin) October

• 108 • Communications TechnologyCommunications TechnologyLabLab

ReferencesReferences Michael E. Kounavis, Alok Kumar, Raj Yavatkar and Harrick Vin, Michael E. Kounavis, Alok Kumar, Raj Yavatkar and Harrick Vin,

““Two Stage Packet Classification Using Most Specific Filter Matching Two Stage Packet Classification Using Most Specific Filter Matching and Transport Level Sharingand Transport Level Sharing”, ”,

Technical ReportTechnical Report, Communications Technology Lab, Intel Corporation, , Communications Technology Lab, Intel Corporation, In Submission to Computer NetworksIn Submission to Computer Networks

Michael E. Kounavis, Alok Kumar, and Raj Yavatkar, Michael E. Kounavis, Alok Kumar, and Raj Yavatkar,

““Sorting Packets by Packet Schedulers Using the Connected Trie Data Sorting Packets by Packet Schedulers Using the Connected Trie Data StructureStructure”, ”,

Technical ReportTechnical Report, Communications Technology Lab, Intel Corporation, , Communications Technology Lab, Intel Corporation, In Submission to Software Practice and ExperienceIn Submission to Software Practice and Experience

Michael E. Kounavis, Alok Kumar, and Raj Yavatkar, Michael E. Kounavis, Alok Kumar, and Raj Yavatkar,

““A Four Level OC-48 Programmable Hierarchical Packet SchedulerA Four Level OC-48 Programmable Hierarchical Packet Scheduler”, ”,

Technical ReportTechnical Report, Communications Technology Lab, Intel Corporation, Communications Technology Lab, Intel Corporation