Framework For Supporting Multi-Service Edge Packet Processing On Network Processors Arun Raghunath, Aaron Kunze, Erik J. Johnson Intel Research and Development

Framework For Supporting Multi-Service Edge Packet Processing On Network

Processors

Framework For Supporting Multi-Service Edge Packet Processing On Network

Processors

Arun Raghunath, Aaron Kunze, Erik J. JohnsonArun Raghunath, Aaron Kunze, Erik J. Johnson

Intel Research and DevelopmentIntel Research and Development

Vinod BalakrishnanVinod Balakrishnan

Openwave Systems Inc.Openwave Systems Inc.

ANCS 2005

Results

2

Resource allocationPolicyMonitoringMechanismsOverview Conclusion

ProblemProblem

Edge routers need to support sophisticated set of Edge routers need to support sophisticated set of servicesservices

How to best use the numerous hardware resources How to best use the numerous hardware resources that Network processors providethat Network processors provide

Cores, multiple memory levels, inter core queuing, crypto Cores, multiple memory levels, inter core queuing, crypto assistsassists

Workloads fluctuate over timeWorkloads fluctuate over time

Overview

Results

3


0

100

200

300

400

500

600

0 50 100 150 200 250

Pro

cess

ing

requ

irem

ent i

n th

e in

terv

al

Timeslice

ALLCOMPRESS

NO COMPRESSDECOMPRESS

NO DECOMPRESS

Source: “A Case for Run-time Adaptation in Packet Processing Systems”, R. Kokku, et. al, Hotnets II, vol. 34, Issue 1, January, 2004

0

50000

100000

150000

200000

0 100 200 300 400 500 600 700 800 900 1000

http_data

avg

http://ita.ee.lbl.gov/html/contrib/UCB.home-IP-HTTP.htmlLocation: Network edge in front of a group of Internet clients Duration: 5 days

ProblemWorkload variations

ProblemWorkload variations

There is no representative workload !There is no representative workload !

Overview

Results

4


ProblemProblem

Edge routers need to support large sets of Edge routers need to support large sets of sophisticated servicessophisticated services

How to best use the numerous hardware resources How to best use the numerous hardware resources that Network processors providethat Network processors provide

Cores, multiple memory levels, inter core queuing, crypto Cores, multiple memory levels, inter core queuing, crypto assistsassists

Workloads fluctuate over timeWorkloads fluctuate over time

There is no representative workloadThere is no representative workload

Usually over provision to handle worst caseUsually over provision to handle worst case

Overview

Run time adaptationRun time adaptation

Ability to change mapping of services to hardware resourcesAbility to change mapping of services to hardware resources

Results

5


Adaptation OpportunitiesAdaptation Opportunities

Ex. 3MEv2

10MEv2

11MEv2

12

MEv215

MEv214

MEv213

MEv29

MEv216

MEv22

MEv23

MEv24

MEv27

MEv26

MEv25

MEv21

MEv28Intel

XScale®core

IPv6 Compression and Forwarding

IPv4 Compression

and Forwarding

MEv210

MEv211

MEv212

MEv215

MEv214

MEv213

MEv29

MEv216

MEv22

MEv23

MEv24

MEv27

MEv26

MEv25

MEv21

MEv28Intel

XScale®core

IPv6 Compression

and Forwarding


Power-down unneeded

processors

MEv210

MEv211

MEv212

MEv215

MEv214

MEv213

MEv29

MEv216

MEv22

MEv23

MEv24

MEv27

MEv26

MEv25

MEv21

MEv28Intel

XScale®core


IPv4 Compression

and ForwardingEx. 1 MEv2

10MEv2

11MEv2

12

MEv215

MEv214

MEv213

MEv29

MEv216

MEv22

MEv23

MEv24

MEv27

MEv26

MEv25

MEv21

MEv28Intel

XScale®core

IPv6 Compression

and Forwarding

IPv4 Compression

and Forwarding

Change allocation to increase

individual service performance

MEv210

MEv211

MEv212

MEv215

MEv214

MEv213

MEv29

MEv216

MEv22

MEv23

MEv24

MEv27

MEv26

MEv25

MEv21

MEv28Intel

XScale®core


IPv4 Compression

and Forwarding

Ex. 2 VPN Encrypt/Dec

ryptMEv2

10MEv2

11MEv2

12

MEv215

MEv214

MEv213

MEv29

MEv216

MEv22

MEv23

MEv24

MEv27

MEv26

MEv25

MEv21

MEv28Intel

XScal®core

IPv6 Compression

and Forwarding

IPv4 Compression

and Forwarding

VPN Encrypt/Decrypt Support a large

set of services in the “fast path”,

according to use

Overview

Results

6


Theory of OperationTheory of Operation

10101010101010101010101010101010B

10101010101010101010101010101010A

A

C

B

Traffic Mix

MEv210

MEv211

MEv212

MEv215

MEv214

MEv213

MEv29

MEv216

MEv22

MEv23

MEv24

MEv27

MEv26

MEv25

MEv21

MEv28Intel

XScale®core

Executable binaries

10101010101010101010101010101010A

10101010101010101010101010101010A

10101010101010101010101010101010B

10101010101010101010101010101010

C

XScale

ME

10101010101010101010101010101010B

10101010101010101010101010101010

C

A

10101010101010101010101010101010

C

B, C

Checkpoint processors

A

10101010101010101010101010101010B

B

10101010101010101010101010101010

C

C

Bind resources

Resource Abstraction Layer (RAL)

Run-time system

System Monitor

Queue info

Linker

Resource Mapping

Overview

Results

7


Rate based MonitoringRate based Monitoring

Rarr Rdep

Observe queue between two stagesObserve queue between two stages

Arrival/departure rates indicative of processing needsArrival/departure rates indicative of processing needs

Monitoring

Assumption: RAssumption: Rdepdep scales linearly. scales linearly.

So for a stage running on n cores, RSo for a stage running on n cores, Rdepdep = n * R = n * Rdep1dep1

Qsize

Rarr = Current arrival rate

Rdep = Current departure rate

Rworst = Worst case arrival rate

tsw = Time to switch on a core

Results

8


Qadap

tBuffer space to handle

Allocation policyAllocation policy Number of Cores = R / RNumber of Cores = R / Rdep1dep1

If R = RIf R = Rworst,worst, system directly moves to worst case provisioned state system directly moves to worst case provisioned state

Only request cores as neededOnly request cores as needed NumCores (RNumCores (Rarrarr) = R) = Rarrarr / R / Rdep1dep1

Policy

Rarr Rdep

If RIf Rarrarr >> R >> Rdepdep, request allocation of processors, immediately, request allocation of processors, immediately How many? function of (RHow many? function of (Rarr arr / R/ Rdep1dep1))

If RIf Rarrarr slightly larger, let queue grow till Q slightly larger, let queue grow till Qadapt, then request allocation of one processor, then request allocation of one processor

worst burst

Results

9


De-allocation policyDe-allocation policy

While increasing allocation, latch RWhile increasing allocation, latch Rdep1dep1

if Rif Rarr arr / R/ Rdep1dep1 < current allocation < current allocation

Request de-allocation of one coreRequest de-allocation of one core

Hysterisis: Wait for some cycles before requesting de-Hysterisis: Wait for some cycles before requesting de-allocation againallocation again

Avoids fluctuations for transient dips in arrival rateAvoids fluctuations for transient dips in arrival rate

Policy

Results

10



A

CB

10101010101010101010101010101010B

10101010101010101010101010101010A

A

C

B

Traffic Mix

MEv210

MEv211

MEv212

MEv215

MEv214

MEv213

MEv29

MEv216

MEv22

MEv23

MEv24

MEv27

MEv26

MEv25

MEv21

MEv28Intel

XScale®core

Executable binaries

10101010101010101010101010101010A

10101010101010101010101010101010A

10101010101010101010101010101010B

10101010101010101010101010101010

C

XScale

ME

10101010101010101010101010101010B

10101010101010101010101010101010

C

A

10101010101010101010101010101010

C

B, C

A

10101010101010101010101010101010B

B

10101010101010101010101010101010

C

C


Run-time system

System Monitor

Queue info

Resource Allocator

Triggers

Linker

Resource Mapping

Overview

Results

11


Resource allocatorResource allocator

Handles requests for allocation/de-allocation from Handles requests for allocation/de-allocation from individual stagesindividual stages

Aware of global system state and decides Aware of global system state and decides

specific processor to allocate/freespecific processor to allocate/free

to de-allocate or migrate stage when no free processors to de-allocate or migrate stage when no free processors availableavailable

Steal only when arrival rate < arrival rate for requesting stageSteal only when arrival rate < arrival rate for requesting stage

whether request is declinedwhether request is declined

Resource Allocation

Results

12



System Evaluation

A

CB

10101010101010101010101010101010B

10101010101010101010101010101010A

Traffic Mix

MEv210

MEv211

MEv212

MEv215

MEv214

MEv213

MEv29

MEv216

MEv22

MEv23

MEv24

MEv27

MEv26

MEv25

MEv21

MEv28Intel

XScale®core

Executable binaries

10101010101010101010101010101010A

10101010101010101010101010101010A

10101010101010101010101010101010B

10101010101010101010101010101010

C

XScale

ME

10101010101010101010101010101010B

10101010101010101010101010101010

C

A

10101010101010101010101010101010

C

B, C

A

10101010101010101010101010101010B

B

10101010101010101010101010101010

C

C


Run-time system

System Monitor

Queue info

Resource Allocator

Triggers Map

pin

g

Linker

Resource Mapping

Overview

Results

13


Experimental setupExperimental setup

Radisys, Inc. ENP-2611*Radisys, Inc. ENP-2611*

600MHz Intel® IXP2400 Processor600MHz Intel® IXP2400 Processor

MontaVista Linux*MontaVista Linux*

3 optical Gigabit Ethernet ports3 optical Gigabit Ethernet ports

IXIA* traffic generator for packet stimulusIXIA* traffic generator for packet stimulus

* Third party brands/names are property of their respective owners

Results

Results

14


Adaptation CostsAdaptation Costs

Overhead due to function calls to resource abstraction layer Overhead due to function calls to resource abstraction layer 14% performance degradation for processing min size packets at line 14% performance degradation for processing min size packets at line

raterate

Overall adaptation time is:Overall adaptation time is: Binding time + Binding time +

(checkpointing and loading time (checkpointing and loading time * number of cores) number of cores)

Cumulative effects: ~100msCumulative effects: ~100ms

Dominated by cost of binding mechanismDominated by cost of binding mechanism

Results

Results

15


Adaptation benefitsTesting Methodology


Need to measure ability of system to handle long Need to measure ability of system to handle long term workload variationsterm workload variations

Systems comparedSystems compared

Static system (Profile driven compilation)Static system (Profile driven compilation)

Adaptive systemAdaptive system

Results

Results

16


L3 forwarder

L2 bridge

L3 forwarder

L2 bridgeL2 bridge

L3 forwarder 10101010101010101010101010101010

10101010101010101010101010101010

10101010101010101010101010101010

MEv210

MEv211

MEv212

MEv215

MEv214

MEv213

MEv29

MEv216

MEv22

MEv23

MEv24

MEv27

MEv26

MEv25

MEv21

MEv28Intel

XScale®core

L2 bridge

L3 forwarderL3 forwarder

L2 bridge

L3 forwarder

L2 bridge

Profile Compiler Static binary

Traffic System Performance

RxL2

classifier

L3 forwarder

L2 bridge

Ethernet encapsulation

Tx



Layer 3 switching application

Results

Results

17


Long duration (60 s) bursts

0

10

20

30

40

50

60

70

80

90

100

0:100 3:97 40:60 50:50 60:40 80:20 97:3 100:0

Input traffic mix (l2, l3) (Absolute rate = 2.5Gbps)

Packets

receiv

ed

rate

/ P

ackets

sen

t ra

te

adaptive

l2_l3_0_100

l2_l3_60_40

0%, 100% 20%, 80% 40%, 60% 50%, 50% 60%, 40% 80%, 20% 100%, 0%

Benefits of run time adaptation

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware

or software design or configuration may affect actual performance.

Source: Intel

Results

Results

18


Future workFuture work

Study ability of an adaptive system to handle short Study ability of an adaptive system to handle short term fluctuationsterm fluctuations

Would it drop more packets than a non-adaptive systemWould it drop more packets than a non-adaptive system

Enable flow-aware run time adaptationEnable flow-aware run time adaptation

Explore more sophisticated resource allocation Explore more sophisticated resource allocation algorithms algorithms

support properties like fairness and performance support properties like fairness and performance guaranteesguarantees

Conclusion

Results

19


Related workRelated work

Ease of programmingEase of programming NP-Click: N Shah etc, NP-2 workshop 2003NP-Click: N Shah etc, NP-2 workshop 2003

Nova: L George, M Blume, ACM SIGPLAN 2003Nova: L George, M Blume, ACM SIGPLAN 2003

Auto-Partitioning programming model: Intel, whitepaper 2003Auto-Partitioning programming model: Intel, whitepaper 2003

Dynamic extensibilityDynamic extensibility Router plugins: D Decasper etc, SIGCOMM 1998Router plugins: D Decasper etc, SIGCOMM 1998

PromethOS: R Keller etc, IWAN 2002PromethOS: R Keller etc, IWAN 2002

VERA: S Karlin, L Peterson, Computer Networks 2002VERA: S Karlin, L Peterson, Computer Networks 2002

NetBind: M Kounavis, Software Practice and experience, 2004NetBind: M Kounavis, Software Practice and experience, 2004

Load balancingLoad balancing ShaRE: R Kokku, Ph.D Thesis UT Austin, 2005ShaRE: R Kokku, Ph.D Thesis UT Austin, 2005

Conclusion

Results

20


ConclusionConclusion

Run time adaptation is an attractive approach for Run time adaptation is an attractive approach for handling traffic fluctuationshandling traffic fluctuations

Implemented a framework capable of adapting Implemented a framework capable of adapting processing cores allocated to network servicesprocessing cores allocated to network services

Implemented a policy thatImplemented a policy that

Automatically balances service pipelineAutomatically balances service pipeline

Overcomes the code store limitation of fixed control store Overcomes the code store limitation of fixed control store processor coresprocessor cores

Conclusion

Results

21


BackgroundBackground

Results

22


CheckpointingLeveraging domain characteristics

CheckpointingLeveraging domain characteristics

Finding the best checkpoint is easier in packet Finding the best checkpoint is easier in packet processing than in general domains processing than in general domains

Characteristics of data-flow applicationsCharacteristics of data-flow applications

Typically implemented as a dispatch loopTypically implemented as a dispatch loop

Dispatch loop is executed at high-frequencyDispatch loop is executed at high-frequency

Top of the dispatch loop has no stack informationTop of the dispatch loop has no stack information

Since compiler creates dispatch loop, compiler Since compiler creates dispatch loop, compiler inserts checkpoints in the codeinserts checkpoints in the code

Mechanisms

Results

23


Why Have Binding?Why Have Binding?

MEv210

MEv211

MEv212

MEv215

MEv214

MEv213

MEv29

MEv216

MEv22

MEv23

MEv24

MEv27

MEv26

MEv25

MEv21

MEv28Intel

XScale™Core

A

B

A B

MEv210

MEv211

MEv212

MEv215

MEv214

MEv213

MEv29

MEv216

MEv22

MEv23

MEv24

MEv27

MEv26

MEv25

MEv21

MEv28Intel

XScale™Core

A B

A B

Want to be able to use the fastest implementations of resources available

Now we can use NN rings,

local locks

Mechanisms

Results

24


BindingBinding

Goal: Use the fastest implementations of resources Goal: Use the fastest implementations of resources availableavailable

Resource abstractionResource abstraction

Programmer’s write to abstract resources (Packet channels, Programmer’s write to abstract resources (Packet channels, uniform memory, locks etc)uniform memory, locks etc)

Must have little impact on run-time performanceMust have little impact on run-time performance

Our approach: Adaptation time linkingOur approach: Adaptation time linking

Mechanisms (4/6)

Results

25


Resource binding approachAdaptation-time linking

Resource binding approachAdaptation-time linking

A microengine-based exampleA microengine-based example

Application .o file RAL .o fileFinal .o file

Application Code

RAL Implementation 0







RAL calls are initially undefined

Linker adjusts jump targets using import variable mechanism



At run time, the RTS has the application .o file

At run time, the RTS has the application .o file and

the RAL .o file

Process repeated after

each adaptation

Mechanisms (6/6)

Results

26


Binding: The Value of Choosing the Right Resource

Binding: The Value of Choosing the Right Resource

Implementation on Intel® Implementation on Intel® IXP2400 ProcessorIXP2400 Processor

# S-push/S-pull bytes# S-push/S-pull bytes % S-push/S-pull % S-push/S-pull bandwidthbandwidth

Next-neighborNext-neighbor 00 0%0%

Scratchpad ringScratchpad ring 44 0.47%0.47%

SRAM ring w/statsSRAM ring w/stats 6868 7.9%7.9%

Performance tests and ratings are measured using specific computer systems and/or components and

reflect the approximate performance of Intel products as measured by those

tests. Any difference in system hardware or software design or configuration may affect actual

performance.

Results

27


SD

Cisco AS5800 SERIES

Power

CISCO SYSTEMS

City

City

SD

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

8260

SD

Telephone

Enterprise LAN

Computer ComputerComputer

Server

Server

Telephone Telephone

SD

Cisco 1720

BRIS/T

CONSOLE

AUXWIC 0 OK

OK

B2

B1

WIC 1 OK

DSUCPU

LNK100FDX

S3

LOOP

LP

Laptop

Access Network

MAN/WAN• VPN Gateway• Firewall• Intrusion Detection

• Forwarding• Switching

• XML & SSL acceleration• L4-L7 switching• Application acceleration

• Compression• Monitoring (billing, QoS)

Problem domainProblem domain

Results

28


Determining Qadapt and monitoring intervalDetermining Qadapt and monitoring interval

Qadap

t

Buffer space to handle worst burst with n+1

cores

Rarr Rdep

Buffer space to handle

worst burst with n coresQueue fill up while

core comes online

Policy

Want to maximize QWant to maximize Qadaptadapt

QQadapt adapt function of queue monitoring intervalfunction of queue monitoring interval

Theoretical max Qadapt when queue depth can be detected instantaneously

Qadap

t

Documents

Framework For Supporting Multi-Service Edge Packet Processing On Network Processors Arun Raghunath, Aaron Kunze, Erik J. Johnson Intel Research and Development