Upload
mavis-powers
View
218
Download
0
Embed Size (px)
Citation preview
Framework For Supporting Multi-Service Edge Packet Processing On Network
Processors
Framework For Supporting Multi-Service Edge Packet Processing On Network
Processors
Arun Raghunath, Aaron Kunze, Erik J. JohnsonArun Raghunath, Aaron Kunze, Erik J. Johnson
Intel Research and DevelopmentIntel Research and Development
Vinod BalakrishnanVinod Balakrishnan
Openwave Systems Inc.Openwave Systems Inc.
ANCS 2005
Results
2
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
ProblemProblem
Edge routers need to support sophisticated set of Edge routers need to support sophisticated set of servicesservices
How to best use the numerous hardware resources How to best use the numerous hardware resources that Network processors providethat Network processors provide
Cores, multiple memory levels, inter core queuing, crypto Cores, multiple memory levels, inter core queuing, crypto assistsassists
Workloads fluctuate over timeWorkloads fluctuate over time
Overview
Results
3
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
0
100
200
300
400
500
600
0 50 100 150 200 250
Pro
cess
ing
requ
irem
ent i
n th
e in
terv
al
Timeslice
ALLCOMPRESS
NO COMPRESSDECOMPRESS
NO DECOMPRESS
Source: “A Case for Run-time Adaptation in Packet Processing Systems”, R. Kokku, et. al, Hotnets II, vol. 34, Issue 1, January, 2004
0
50000
100000
150000
200000
0 100 200 300 400 500 600 700 800 900 1000
http_data
avg
http://ita.ee.lbl.gov/html/contrib/UCB.home-IP-HTTP.htmlLocation: Network edge in front of a group of Internet clients Duration: 5 days
ProblemWorkload variations
ProblemWorkload variations
There is no representative workload !There is no representative workload !
Overview
Results
4
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
ProblemProblem
Edge routers need to support large sets of Edge routers need to support large sets of sophisticated servicessophisticated services
How to best use the numerous hardware resources How to best use the numerous hardware resources that Network processors providethat Network processors provide
Cores, multiple memory levels, inter core queuing, crypto Cores, multiple memory levels, inter core queuing, crypto assistsassists
Workloads fluctuate over timeWorkloads fluctuate over time
There is no representative workloadThere is no representative workload
Usually over provision to handle worst caseUsually over provision to handle worst case
Overview
Run time adaptationRun time adaptation
Ability to change mapping of services to hardware resourcesAbility to change mapping of services to hardware resources
Results
5
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
Adaptation OpportunitiesAdaptation Opportunities
Ex. 3MEv2
10MEv2
11MEv2
12
MEv215
MEv214
MEv213
MEv29
MEv216
MEv22
MEv23
MEv24
MEv27
MEv26
MEv25
MEv21
MEv28Intel
XScale®core
IPv6 Compression and Forwarding
IPv4 Compression
and Forwarding
MEv210
MEv211
MEv212
MEv215
MEv214
MEv213
MEv29
MEv216
MEv22
MEv23
MEv24
MEv27
MEv26
MEv25
MEv21
MEv28Intel
XScale®core
IPv6 Compression
and Forwarding
IPv4 Compression and Forwarding
Power-down unneeded
processors
MEv210
MEv211
MEv212
MEv215
MEv214
MEv213
MEv29
MEv216
MEv22
MEv23
MEv24
MEv27
MEv26
MEv25
MEv21
MEv28Intel
XScale®core
IPv6 Compression and Forwarding
IPv4 Compression
and ForwardingEx. 1 MEv2
10MEv2
11MEv2
12
MEv215
MEv214
MEv213
MEv29
MEv216
MEv22
MEv23
MEv24
MEv27
MEv26
MEv25
MEv21
MEv28Intel
XScale®core
IPv6 Compression
and Forwarding
IPv4 Compression
and Forwarding
Change allocation to increase
individual service performance
MEv210
MEv211
MEv212
MEv215
MEv214
MEv213
MEv29
MEv216
MEv22
MEv23
MEv24
MEv27
MEv26
MEv25
MEv21
MEv28Intel
XScale®core
IPv6 Compression and Forwarding
IPv4 Compression
and Forwarding
Ex. 2 VPN Encrypt/Dec
ryptMEv2
10MEv2
11MEv2
12
MEv215
MEv214
MEv213
MEv29
MEv216
MEv22
MEv23
MEv24
MEv27
MEv26
MEv25
MEv21
MEv28Intel
XScal®core
IPv6 Compression
and Forwarding
IPv4 Compression
and Forwarding
VPN Encrypt/Decrypt Support a large
set of services in the “fast path”,
according to use
Overview
Results
6
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
Theory of OperationTheory of Operation
10101010101010101010101010101010B
10101010101010101010101010101010A
A
C
B
Traffic Mix
MEv210
MEv211
MEv212
MEv215
MEv214
MEv213
MEv29
MEv216
MEv22
MEv23
MEv24
MEv27
MEv26
MEv25
MEv21
MEv28Intel
XScale®core
Executable binaries
10101010101010101010101010101010A
10101010101010101010101010101010A
10101010101010101010101010101010B
10101010101010101010101010101010
C
XScale
ME
10101010101010101010101010101010B
10101010101010101010101010101010
C
A
10101010101010101010101010101010
C
B, C
Checkpoint processors
A
10101010101010101010101010101010B
B
10101010101010101010101010101010
C
C
Bind resources
Resource Abstraction Layer (RAL)
Run-time system
System Monitor
Queue info
Linker
Resource Mapping
Overview
Results
7
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
Rate based MonitoringRate based Monitoring
Rarr Rdep
Observe queue between two stagesObserve queue between two stages
Arrival/departure rates indicative of processing needsArrival/departure rates indicative of processing needs
Monitoring
Assumption: RAssumption: Rdepdep scales linearly. scales linearly.
So for a stage running on n cores, RSo for a stage running on n cores, Rdepdep = n * R = n * Rdep1dep1
Qsize
Rarr = Current arrival rate
Rdep = Current departure rate
Rworst = Worst case arrival rate
tsw = Time to switch on a core
Results
8
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
Qadap
tBuffer space to handle
Allocation policyAllocation policy Number of Cores = R / RNumber of Cores = R / Rdep1dep1
If R = RIf R = Rworst,worst, system directly moves to worst case provisioned state system directly moves to worst case provisioned state
Only request cores as neededOnly request cores as needed NumCores (RNumCores (Rarrarr) = R) = Rarrarr / R / Rdep1dep1
Policy
Rarr Rdep
If RIf Rarrarr >> R >> Rdepdep, request allocation of processors, immediately, request allocation of processors, immediately How many? function of (RHow many? function of (Rarr arr / R/ Rdep1dep1))
If RIf Rarrarr slightly larger, let queue grow till Q slightly larger, let queue grow till Qadapt, then request allocation of one processor, then request allocation of one processor
worst burst
Results
9
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
De-allocation policyDe-allocation policy
While increasing allocation, latch RWhile increasing allocation, latch Rdep1dep1
if Rif Rarr arr / R/ Rdep1dep1 < current allocation < current allocation
Request de-allocation of one coreRequest de-allocation of one core
Hysterisis: Wait for some cycles before requesting de-Hysterisis: Wait for some cycles before requesting de-allocation againallocation again
Avoids fluctuations for transient dips in arrival rateAvoids fluctuations for transient dips in arrival rate
Policy
Results
10
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
Theory of OperationTheory of Operation
A
CB
10101010101010101010101010101010B
10101010101010101010101010101010A
A
C
B
Traffic Mix
MEv210
MEv211
MEv212
MEv215
MEv214
MEv213
MEv29
MEv216
MEv22
MEv23
MEv24
MEv27
MEv26
MEv25
MEv21
MEv28Intel
XScale®core
Executable binaries
10101010101010101010101010101010A
10101010101010101010101010101010A
10101010101010101010101010101010B
10101010101010101010101010101010
C
XScale
ME
10101010101010101010101010101010B
10101010101010101010101010101010
C
A
10101010101010101010101010101010
C
B, C
A
10101010101010101010101010101010B
B
10101010101010101010101010101010
C
C
Resource Abstraction Layer (RAL)
Run-time system
System Monitor
Queue info
Resource Allocator
Triggers
Linker
Resource Mapping
Overview
Results
11
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
Resource allocatorResource allocator
Handles requests for allocation/de-allocation from Handles requests for allocation/de-allocation from individual stagesindividual stages
Aware of global system state and decides Aware of global system state and decides
specific processor to allocate/freespecific processor to allocate/free
to de-allocate or migrate stage when no free processors to de-allocate or migrate stage when no free processors availableavailable
Steal only when arrival rate < arrival rate for requesting stageSteal only when arrival rate < arrival rate for requesting stage
whether request is declinedwhether request is declined
Resource Allocation
Results
12
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
Theory of OperationTheory of Operation
System Evaluation
A
CB
10101010101010101010101010101010B
10101010101010101010101010101010A
Traffic Mix
MEv210
MEv211
MEv212
MEv215
MEv214
MEv213
MEv29
MEv216
MEv22
MEv23
MEv24
MEv27
MEv26
MEv25
MEv21
MEv28Intel
XScale®core
Executable binaries
10101010101010101010101010101010A
10101010101010101010101010101010A
10101010101010101010101010101010B
10101010101010101010101010101010
C
XScale
ME
10101010101010101010101010101010B
10101010101010101010101010101010
C
A
10101010101010101010101010101010
C
B, C
A
10101010101010101010101010101010B
B
10101010101010101010101010101010
C
C
Resource Abstraction Layer (RAL)
Run-time system
System Monitor
Queue info
Resource Allocator
Triggers Map
pin
g
Linker
Resource Mapping
Overview
Results
13
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
Experimental setupExperimental setup
Radisys, Inc. ENP-2611*Radisys, Inc. ENP-2611*
600MHz Intel® IXP2400 Processor600MHz Intel® IXP2400 Processor
MontaVista Linux*MontaVista Linux*
3 optical Gigabit Ethernet ports3 optical Gigabit Ethernet ports
IXIA* traffic generator for packet stimulusIXIA* traffic generator for packet stimulus
* Third party brands/names are property of their respective owners
Results
Results
14
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
Adaptation CostsAdaptation Costs
Overhead due to function calls to resource abstraction layer Overhead due to function calls to resource abstraction layer 14% performance degradation for processing min size packets at line 14% performance degradation for processing min size packets at line
raterate
Overall adaptation time is:Overall adaptation time is: Binding time + Binding time +
(checkpointing and loading time (checkpointing and loading time * number of cores) number of cores)
Cumulative effects: ~100msCumulative effects: ~100ms
Dominated by cost of binding mechanismDominated by cost of binding mechanism
Results
Results
15
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
Adaptation benefitsTesting Methodology
Adaptation benefitsTesting Methodology
Need to measure ability of system to handle long Need to measure ability of system to handle long term workload variationsterm workload variations
Systems comparedSystems compared
Static system (Profile driven compilation)Static system (Profile driven compilation)
Adaptive systemAdaptive system
Results
Results
16
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
L3 forwarder
L2 bridge
L3 forwarder
L2 bridgeL2 bridge
L3 forwarder 10101010101010101010101010101010
10101010101010101010101010101010
10101010101010101010101010101010
MEv210
MEv211
MEv212
MEv215
MEv214
MEv213
MEv29
MEv216
MEv22
MEv23
MEv24
MEv27
MEv26
MEv25
MEv21
MEv28Intel
XScale®core
L2 bridge
L3 forwarderL3 forwarder
L2 bridge
L3 forwarder
L2 bridge
Profile Compiler Static binary
Traffic System Performance
RxL2
classifier
L3 forwarder
L2 bridge
Ethernet encapsulation
Tx
Adaptation benefitsTesting Methodology
Adaptation benefitsTesting Methodology
Layer 3 switching application
Results
Results
17
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
Long duration (60 s) bursts
0
10
20
30
40
50
60
70
80
90
100
0:100 3:97 40:60 50:50 60:40 80:20 97:3 100:0
Input traffic mix (l2, l3) (Absolute rate = 2.5Gbps)
Packets
receiv
ed
rate
/ P
ackets
sen
t ra
te
adaptive
l2_l3_0_100
l2_l3_60_40
0%, 100% 20%, 80% 40%, 60% 50%, 50% 60%, 40% 80%, 20% 100%, 0%
Benefits of run time adaptation
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware
or software design or configuration may affect actual performance.
Source: Intel
Results
Results
18
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
Future workFuture work
Study ability of an adaptive system to handle short Study ability of an adaptive system to handle short term fluctuationsterm fluctuations
Would it drop more packets than a non-adaptive systemWould it drop more packets than a non-adaptive system
Enable flow-aware run time adaptationEnable flow-aware run time adaptation
Explore more sophisticated resource allocation Explore more sophisticated resource allocation algorithms algorithms
support properties like fairness and performance support properties like fairness and performance guaranteesguarantees
Conclusion
Results
19
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
Related workRelated work
Ease of programmingEase of programming NP-Click: N Shah etc, NP-2 workshop 2003NP-Click: N Shah etc, NP-2 workshop 2003
Nova: L George, M Blume, ACM SIGPLAN 2003Nova: L George, M Blume, ACM SIGPLAN 2003
Auto-Partitioning programming model: Intel, whitepaper 2003Auto-Partitioning programming model: Intel, whitepaper 2003
Dynamic extensibilityDynamic extensibility Router plugins: D Decasper etc, SIGCOMM 1998Router plugins: D Decasper etc, SIGCOMM 1998
PromethOS: R Keller etc, IWAN 2002PromethOS: R Keller etc, IWAN 2002
VERA: S Karlin, L Peterson, Computer Networks 2002VERA: S Karlin, L Peterson, Computer Networks 2002
NetBind: M Kounavis, Software Practice and experience, 2004NetBind: M Kounavis, Software Practice and experience, 2004
Load balancingLoad balancing ShaRE: R Kokku, Ph.D Thesis UT Austin, 2005ShaRE: R Kokku, Ph.D Thesis UT Austin, 2005
Conclusion
Results
20
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
ConclusionConclusion
Run time adaptation is an attractive approach for Run time adaptation is an attractive approach for handling traffic fluctuationshandling traffic fluctuations
Implemented a framework capable of adapting Implemented a framework capable of adapting processing cores allocated to network servicesprocessing cores allocated to network services
Implemented a policy thatImplemented a policy that
Automatically balances service pipelineAutomatically balances service pipeline
Overcomes the code store limitation of fixed control store Overcomes the code store limitation of fixed control store processor coresprocessor cores
Conclusion
Results
21
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
BackgroundBackground
Results
22
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
CheckpointingLeveraging domain characteristics
CheckpointingLeveraging domain characteristics
Finding the best checkpoint is easier in packet Finding the best checkpoint is easier in packet processing than in general domains processing than in general domains
Characteristics of data-flow applicationsCharacteristics of data-flow applications
Typically implemented as a dispatch loopTypically implemented as a dispatch loop
Dispatch loop is executed at high-frequencyDispatch loop is executed at high-frequency
Top of the dispatch loop has no stack informationTop of the dispatch loop has no stack information
Since compiler creates dispatch loop, compiler Since compiler creates dispatch loop, compiler inserts checkpoints in the codeinserts checkpoints in the code
Mechanisms
Results
23
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
Why Have Binding?Why Have Binding?
MEv210
MEv211
MEv212
MEv215
MEv214
MEv213
MEv29
MEv216
MEv22
MEv23
MEv24
MEv27
MEv26
MEv25
MEv21
MEv28Intel
XScale™Core
A
B
A B
MEv210
MEv211
MEv212
MEv215
MEv214
MEv213
MEv29
MEv216
MEv22
MEv23
MEv24
MEv27
MEv26
MEv25
MEv21
MEv28Intel
XScale™Core
A B
A B
Want to be able to use the fastest implementations of resources available
Now we can use NN rings,
local locks
Mechanisms
Results
24
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
BindingBinding
Goal: Use the fastest implementations of resources Goal: Use the fastest implementations of resources availableavailable
Resource abstractionResource abstraction
Programmer’s write to abstract resources (Packet channels, Programmer’s write to abstract resources (Packet channels, uniform memory, locks etc)uniform memory, locks etc)
Must have little impact on run-time performanceMust have little impact on run-time performance
Our approach: Adaptation time linkingOur approach: Adaptation time linking
Mechanisms (4/6)
Results
25
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
Resource binding approachAdaptation-time linking
Resource binding approachAdaptation-time linking
A microengine-based exampleA microengine-based example
Application .o file RAL .o fileFinal .o file
Application Code
RAL Implementation 0
RAL Implementation 1
RAL Implementation 2
RAL Implementation 3
RAL Implementation 4
RAL Implementation 5
RAL Implementation 6
RAL calls are initially undefined
Linker adjusts jump targets using import variable mechanism
Linker adjusts jump targets using import variable mechanism
Linker adjusts jump targets using import variable mechanism
At run time, the RTS has the application .o file
At run time, the RTS has the application .o file and
the RAL .o file
Process repeated after
each adaptation
Mechanisms (6/6)
Results
26
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
Binding: The Value of Choosing the Right Resource
Binding: The Value of Choosing the Right Resource
Implementation on Intel® Implementation on Intel® IXP2400 ProcessorIXP2400 Processor
# S-push/S-pull bytes# S-push/S-pull bytes % S-push/S-pull % S-push/S-pull bandwidthbandwidth
Next-neighborNext-neighbor 00 0%0%
Scratchpad ringScratchpad ring 44 0.47%0.47%
SRAM ring w/statsSRAM ring w/stats 6868 7.9%7.9%
Performance tests and ratings are measured using specific computer systems and/or components and
reflect the approximate performance of Intel products as measured by those
tests. Any difference in system hardware or software design or configuration may affect actual
performance.
Results
27
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
SD
Cisco AS5800 SERIES
Power
CISCO SYSTEMS
City
City
SD
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
8260
SD
Telephone
Enterprise LAN
Computer ComputerComputer
Server
Server
Telephone Telephone
SD
Cisco 1720
BRIS/T
CONSOLE
AUXWIC 0 OK
OK
B2
B1
WIC 1 OK
DSUCPU
LNK100FDX
S3
LOOP
LP
Laptop
Access Network
MAN/WAN• VPN Gateway• Firewall• Intrusion Detection
• Forwarding• Switching
• XML & SSL acceleration• L4-L7 switching• Application acceleration
• Compression• Monitoring (billing, QoS)
Problem domainProblem domain
Results
28
Resource allocationPolicyMonitoringMechanismsOverview Conclusion
Determining Qadapt and monitoring intervalDetermining Qadapt and monitoring interval
Qadap
t
Buffer space to handle worst burst with n+1
cores
Rarr Rdep
Buffer space to handle
worst burst with n coresQueue fill up while
core comes online
Policy
Want to maximize QWant to maximize Qadaptadapt
QQadapt adapt function of queue monitoring intervalfunction of queue monitoring interval
Theoretical max Qadapt when queue depth can be detected instantaneously
Qadap
t