BEBA Behavioural Based Forwarding Deliverable … BEhavioural BAsed forwarding BEBA Behavioural Based Forwarding Grant Agreement: 644122 BEBA/WP6 – D6.2 Version: 1.0 Page 1 …

BebaBEhavioural BAsed forwarding

BEBA Behavioural Based

Forwarding Grant Agreement: 644122

BEBA/WP6–D6.2 Version:1.0 Page1of36

BEBA

Behavioural Based Forwarding

Deliverable Report

D6.2 – Technology, performance and functional assessment

Deliverable title Revision of the BEBA Data Plane Software Design, Implementation and Acceleration

Version 1.0 Due date of deliverable (month) January 2016

Actual submission date of the deliverable (dd/mm/yyyy) 17/03/2017

Start date of project (dd/mm/yyyy) 01/01/2015

Duration of the project 27 months Work Package WP6 Task T6.1 Leader for this deliverable NEC

Other contributing partners

- 6WIND - CNIT - CESNET - KTH - TCS

Authors R. Bifulco, F. Huici (NEC), V. Puš, L. Polčák, P. Benáček (CES), Q. Monnet (6WIND), G. Procissi, N. Bonelli, D. Adami , M. Bonola (CNIT)

Deliverable reviewer(s) R. Bifulco (NEC)

Deliverable abstract This deliverable documents the performance results and the functional operations of stand-alone BEBA prototypes.

Keywords High performance, prototype, use cases





Project co-funded by the European Commission within the Horizon 2020 (H2020) Programme

DISSEMINATION LEVEL PU Public X PP Restricted to other programme participants (including the Commission

Services)

RE Restricted to a group specified by the consortium (including the Commission Services)

CO Confidential, only for members of the consortium (including the Commission Services)

REVISION HISTORY

Revision Date Author Organisation Description 0.1 31/01/2017 R. Bifulco,

F. Huici NEC Table of contents, ARP use cases,

mSwitch impl. 0.2 15/02/2017 V. Puš CES DDoS evaluation added 0.3 17/02/2017 L. Polčák CES DDoS evaluation extended 0.4 20/02/2017 Q. Monnet 6WIND eBPF evaluation 0.5 28/02/2017 G. Procissi CNIT OFSoftSwitch performance 0.6 12/03/2017 M. Bonola CNIT Additional use cases 1.0 17/03/2017 R. Bifulco NEC Finalization and review

PROPRIETARY RIGHTS STATEMENT This document contains information, which is proprietary to the BEBA consortium. Neither this document nor the information contained herein shall be used, duplicated or communicated by any means to any third party, in whole or in parts, except with the prior written consent of the BEBA consortium. This restriction legend shall not be altered or obliterated on or from this document.

STATEMENT OF ORIGINALITY This deliverable contains original unpublished work except where clearly indicated otherwise. Acknowledgement of previously published material and of the work of others has been made through appropriate citation, quotation or both.





TableofContents

TableofContents.......................................................................................................................................31. Introduction.....................................................................................................................................42. Functionalassessment..................................................................................................................52.1. SPIDERwithstatesynchronization.....................................................................................................72.2. MACLearning..............................................................................................................................................72.3. ForwardingConsistency..........................................................................................................................72.4. Loadbalancer..............................................................................................................................................72.5. Statefulfirewall..........................................................................................................................................72.6. DynamicNAT...............................................................................................................................................72.7. ARPhandling...............................................................................................................................................72.7.1. Performanceevaluation.....................................................................................................................................8

2.8. Tokenbucketrate-limiter...................................................................................................................132.8.1. Description............................................................................................................................................................132.8.2. Functionalassessment—eBPFversion..................................................................................................14

2.9. Portknocking...........................................................................................................................................152.9.1. Description............................................................................................................................................................152.9.2. Functionalassessment—eBPFversion..................................................................................................16

2.10. DDoSDetectionandMitigation......................................................................................................162.11. AXE...........................................................................................................................................................182.11.1. Detailedconfiguration..................................................................................................................182.12. HULA........................................................................................................................................................202.12.1. Detailedconfiguration..................................................................................................................20

3. Performanceassessment..........................................................................................................233.1. OFSoftSwitch.............................................................................................................................................233.1.1. Experimentalsetup...........................................................................................................................................233.1.2. OpenFlowperformance...................................................................................................................................233.1.3. BEBAstatelessprocessing.............................................................................................................................253.1.4. BEBAstatefulprocessing................................................................................................................................26

3.2. eBPF.............................................................................................................................................................273.2.1. BEBAwitheBPF..................................................................................................................................................273.2.2. Parameterstotest..............................................................................................................................................273.2.3. Topology................................................................................................................................................................283.2.4. Results.....................................................................................................................................................................293.2.5. Improvementsandfuturetests...................................................................................................................32

3.3. mSwitch......................................................................................................................................................33

4. Conclusions....................................................................................................................................355. Bibliography..................................................................................................................................36





1. IntroductionThis deliverable presents a summary of the use cases that have been implemented using the forwarding abstractions defined during the BEBA project. For each use case, we provide a brief description of the implementation details and of the environments used for the functional validation (In several cases, we provide pointers to extended descriptions that were included in other deliverables). When available, we also provide additional performance assessment of a use case implementation. In general, we conclude that the BEBA abstraction is flexible enough to describe a large number of use cases, while still providing good performance levels both in software and hardware. The second part of this deliverable focuses specifically on a performance assessment of the BEBA switch software implementations. In particular, we report on the performance tests performed on three different implementations: OfSoftSwitch [3], eBPF [4], mSwitch [5]. The three implementations have different properties in terms of scalability, ease of use, deployability. We believe that showing the ability to efficiently implement the BEBA abstraction with these three solutions clearly proves the portability of the abstraction.





2. FunctionalassessmentThis section reports the various applications developed using the BEBA abstractions, the currently available implementations and the environment used for performing the functional assessment. Application Description Implementation/test

environment Involved partners

Mac learning With a BEBA switch a MAC learning operation becomes trivial, while in case of “standard” OpenFlow would require an explicit interaction with the controller for each new flow.

OfSoftSwitch implementation of a BEBA switch running in a controlled testbed of virtual machines. . Proof of concept validated with NetFPGA prototype.

CNIT

Forwarding Consistency

Simple use case for traffic load balancing over multiple paths (also known as load sharing) is an important feature that allows flexible and efficient allocation of network resources.

OfSoftSwitch implementation of a BEBA switch running in a controlled testbed of virtual machines. Proof of concept validated with NetFPGA prototype.

CNIT

SPIDER with state synchronization

SPIDER is a fault resilient SDN pipeline design that allows the implementation of failure recovery policies with fully programmable detection and reaction mechanisms in the switches.

OfSoftSwitch implementation of a BEBA switch running in a controlled testbed of virtual machines.

CNIT/KTH

Load balancer Implementation of a load balancer to distribute TCP connections to different TCP servers

OfSoftSwitch implementation of a BEBA switch running in a controlled testbed of virtual machines. Proof of concept validated with NetFPGA prototype

CNIT, NEC

Stateful firewall Implementation of a stateful firewall using BEBA abstractions.

OfSoftSwitch implementation of a BEBA switch running in a controlled testbed of virtual machines. Proof of concept validated

CNIT, NEC





with NetFPGA prototype

Dynamic NAT Implementation of a dynamic NAT that assigns network ports to new incoming flows


CNIT, NEC

ARP handling Implementation of an ARP responder using the BEBA abstraction’s in-switch packet generation capabilities.


NEC, Thales

Token bucket A per-host token bucket used to limit traffic rate, and relying on the per-flow registers and conditions of the extended BEBA interface

Linux eBPF 6WIND

Port knocking To hide a given L4 port of a server until the client correctly sends a secret packet sequence. Relies on the primary BEBA interface.

Linux eBPF 6WIND

DDoS detection and mitigation

Implementation of stateful protection against SYN-flood and other types of volumetric DDoS attacks.

OfSoftSwitch implementation of a BEBA switch running in a controlled testbed of virtual machines, using real traffic. Results are correlated with those of independent monitoring systems.

CESNET, Thales

AXE Distributed data driven network topology discovery mechanism prposed in [1]


CNIT

HULA Flow-let based distributed load balancing mechanism proposed in [2]


CNIT





2.1. SPIDERwithstatesynchronizationThe SPIDER use case’s detailed implementation has been reported in Deliverable 5.2, section 3.1.

2.2. MACLearningThe Mac Learning use case’s detailed implementation has been reported in Deliverable 2.1, section 5.1.

2.3. ForwardingConsistencyThe Forwarding consistency use case’s detailed implementation has been reported in Deliverable 2.1, section 5.2.

2.4. LoadbalancerThe load balancer use case’s detailed implementation has been reported in Deliverable 5.3, section 2.3.5

2.5. StatefulfirewallThe Stateful firewall use case’s detailed implementation has been reported in Deliverable 5.3, section 2.3.4

2.6. DynamicNATThe Stateful firewall use case’s detailed implementation has been reported in Deliverable 5.3, section 2.3.6

2.7. ARPhandlingAs reported in D4.2, we implemented a ARP handling use case using an OfSoftSwitch implementation of the BEBA abstraction. For the controller side, we modified the RYU SDN framework to support the generations of the PKTTMP_MOD messages and the specification of InSP [6] instructions. All the tests were executed in a controlled emulated testbed running on a computer equipped with an Intel(R) Core(TM) i5-2540M CPU @ 2.60GHz (2 cores, 4 threads). During the tests hyper-threading has been disabled. The operating system and the load generator run on the first CPU's core. Nping is used to generate traffic, sending ARP packets at different rates depending on the test. The controller and the switch share the second CPU's core. Any communication between the controller and the switch happens over a local TCP channel. The controller's application installs a PTE at the switch, which contains an ARP reply template. Then, it installs an FTE that contains an InSP instruction, in order to trigger the packet generation upon reception of an ARP request. In all the cases we compare the InSP implementation of the ARP responder with an analogous application implemented using standard OpenFlow. The OpenFlow ARP responder installs a FTE at the switch to generate a





PACKET_IN upon reception of an ARP request. The controller answers to the PACKET_IN with a PACKET_OUT that contains a statically defined ARP response, in order to minimize the processing time at the controller.

2.7.1. Performance evaluation Performance tests have been already reported in D4.2, here we present a short summary of the results presented in D4.2 and extend the evaluation to a datacenter use case. Reaction time The reaction time is the measured time it takes to send an ARP request and receive the corresponding ARP reply. That is, we define the reaction time as the time difference between the reception of an ARP reply and the time at which the corresponding ARP request was sent. We configured Nping to generate a total of 100 ARP requests at a rate of 5 requests per second.

Figure 1: Time (in milliseconds) to generate an ARP reply when using traditional OpenFlow (i.e., involving the controller) and when using InSP.

As expected, the response time is much lower for the InSP case (cf. Figure 1), with an average reaction time of less than 1ms, since the ARP reply is generated as soon as the ARP request is received at the switch. For the OpenFlow case, instead, the generation of a response requires a round trip with the controller (PACKET_IN + PACKET_OUT), which is furthermore running on top of the python interpreter. Thus, the reaction time grows to 10-20ms, in most of the cases. Evaluation in a datacenter In this paragraph we perform an evaluation of the impact of using the BEBA switch in a network for handling ARP. To this end, we provide an analytical study of an InSP-based implementation of the ARP handling use case in a datacenter. Our analytical evaluation compares the number of control and data massages generated by the InSP-based solution against those generated by the OpenFlow reactive approaches presented in D5.3’s Section 3.3.





Figure 2: Datacenter topology

For the purpose of this analysis, we assume a typical datacenter hierarchical network like the one shown in Figure 2. In such highly redundant topologies, protocols like ARP can produce a large amount of broadcast traffic. In our model, the network is managed by an OpenFlow controller and it is composed of: (i) a single core switch (C); (ii) M aggregation switches, each of which is connected to the core switch and to each of its neighboring aggregation switch(es); (iii) as many edge switches as aggregation ones, i.e., M, each of which is connected to all the aggregation switches and to each of its neighboring edge switch(es). Finally, we assume that each of the edge switches is connected to hosts. The total number of switches (S), hosts (N) and links (E) in the considered network is shown in the following table, together with a summary of the aforementioned parameters.

To perform our analytical evaluation, we introduce also the parameter d (for ”distance”), which represents the maximum number of switches in the shortest path between two hosts. Throughout our evaluation, we assign to each host a state value that can be either learnt (L) or not learnt (NL). For a learnt host, the controller knows the host’s MAC/IP address mapping, while in case of a not learnt host the mapping is unknown. This classification is helpful to define the behavior of the controller in the 4 possible communication scenarios for an ARP interaction between hosts. That is, an ARP request can be sent by a host which is either learnt or not, to a host which is either learnt or not. Notice in our model a host never changes its state from learnt to not learnt. While the state change from not learnt to learnt can happen in different ways, depending on the considered ARP handling implementation. In all the cases, whenever a host state changes from not learnt to learnt, the controller installs a FTE in all switches to enable direct forwarding of unicast L2 frames to the learnt host. Also, we





assume the controller will enforce the forwarding of the flows using the shortest path between the hosts. Base scenarios and closed-form formulas. We consider three ARP handling implementation approaches: OpenFlow unicast, OpenFlow responder and InSP (cf. D5.3 Sectopm 3.3). In OF-unicast, when a host (L or NL) sends an ARP request, the first switch on the path generates a PACKET IN message from which, if the host was not learnt, the controller learns the host’s MAC/IP address mapping. Then, if the destination host is learnt, the controller sends a PACKET OUT back to the switch transforming the ARP request from broadcast to unicast. Otherwise, the ARP request is flooded. In the former case, the ARP request travels throughout the network to the destination host, which generates the ARP reply and sends it back to the requesting host. In the latter case, the ARP request is flooded and at each next switch a new interaction with the controller (PACKET IN/PACKET OUT) happens, followed by a new flooding. Notice that, in this case, we assume the controller enforces a spanning tree for the broadcast packets, i.e., broadcast messages are counted once per link. In OF-responder, when a host (L or NL) sends an ARP request, the first switch on the path generates a PACKET IN message from which, if the host was not learnt, the controller learns the host’s MAC/IP address mapping. If the destination host is learnt, the controller generates an ARP reply and sends it back to the switch using a PACKET OUT message. In turn, the switch forwards the ARP reply to the requesting host. When the destination host is not learnt, the controller sends a PACKET OUT for each port not connected to a switch of each edge switch. That is, in our model the controller sends as many PACKET OUTs as hosts3. The switches in turn forward the ARP request contained in the received PACKET OUT. When the destination host responds with an ARP reply, the receiving switch sends it to the controller using a PACKET IN message. At this point, the state of the destination host becomes learnt, and the controller sends a PACKET OUT to the requesting host’s switch, in order to deliver the ARP reply. In the InSP case, all the interactions involving not learnt hosts work as in the OF-responder case. However, whenever a host is learnt, together with the installation of the related FTEs, the controller installs PTEs for generating ARP replies for such host, in all the edge switches. Thus, after a host state becomes learnt, any ARP reply, for an incoming ARP request for such host, is generated directly at the edge switch where the request is first received. In the following table we provide the formulas for calculating the number of messages, associated to each scenario for each category of traffic.





From the formulas, we can observe that OF-unicast uses more dataplane messages, while OF-responder and InSP generate the same number of ARP messages. Furthermore, OF-responder and InSP generate the same number of control plane messages for the 3 first scenarios. However, once mappings are learnt (”L-L” scenario), InSP does not need generating any control plane messages any more, since ARP replies are generated by the switch. Indeed, this last case where MAC/IP address mappings are known is supposed to happen more often than the cases that involve interactions with not learnt hosts. That is, we assume that the learning usually happens only when the host first connects to the network and that the learnt information does not change for a reasonable period of time.

Figure 3: Datacenter evaluation results

General evaluation. To quantify the benefits that InSP may offer, we perform an analysis that is representative of an operational network. We analyze the number of messages





generated by the ARP protocol during normal operation, i.e., by considering together the individual scenarios identified above. For this evaluation, we consider that each host in the studied topology wants to ping all the other hosts. Assuming that no host is learnt at the beginning, and that pings are generated sequentially (no simultaneous pings). Thus, when a first host pings the N-1 other hosts, the ARP request generated for the very first ping falls into the NL to NL case, and the first host becomes learnt. Then, the ARP requests generated for the N-2 following pings fall into the L to NL scenario. After the N-1 first pings, all hosts are Learnt, and all subsequent ARP requests fall into the L to L scenario. The sum of the formulas identified above for the individual scenarios, weighted according to this sequence, gives closed form formulas to compute the number of messages generated for this global operation for the 3 approaches. The generic formula is given by the following equation:

We computed the formula for each ARP handling approach and for different topology sizes, when the number of hosts in the network increases. The results are shown in Figure 3, where we plot (i) the number of ARP requests and replies (Figure 3 (a) to (c)), the number of control plane messages (Figure 3 (d) to (f)) and (iii) the resulting messages reduction in the data and control planes offered by InSP versus OFunicast and OF-responder. On the data plane side, we observe (Figure 3 (a) to (c)) that OF-unicast generates more messages than any of the 2 other approaches, no matter the size of the topology and the number of hosts. This is due to not only broadcasting ARP requests when hosts are not learnt, but also forwarding ARP requests up to the target host even when it is learnt. With OFresponder however, the controller forwards the request to only a subset of switches when not learnt hosts must be discovered, and handles itself the resolution for learnt hosts. This way, only 2 ARP messages are generated for the resolution of known mappings. In this, InSP performs the same as OFresponder in term of data messages. However, in this case the ARP replies are generated by the switch that received the request, without involving the controller (and thus improving also reaction times). The offloading of the control plane is visible in Figure 3 (d) to (f), where the number of packets generated with InSP is clearly lower than for the 2 other approaches, whatever the size of the topology and the number of hosts. It is interesting to observe that OF-responder generates more control plane messages than OF-unicast when the number of nodes is high in small to medium topologies. This happens because in OFresponder the number of Packet-Out messages is proportional to the number of hosts and switches, while it is rather related to the size of the topology (number of links and switches) in the OF-unicast case. We also note that, while InSP behaves like OF-responder during the learning phase, it produces much less messages because no controller interactions are required once mappings are learnt. A summary of the reduction in the number of messages obtained using InSP is shown in Figure 3 (g) to (i). In these figures, since InSP does not offer any gain in the data plane compared to OF-responder, we did not plot the corresponding line. When compared to OF-unicast, however, InSP saves from 63% to 91% of ARP messages for a small number of nodes in small to large topologies, respectively, and this gain converges around 67% in any case. In the control plane, compared to OF-unicast, InSP saves from 58% to 96% of the messages for a small number of





nodes in small to large topologies, respectively. These savings decrease to about 50%, 55% and 66% when the number of nodes grows in small, medium and large topologies, respectively. In any case, the control message savings are always above 50%. When compared to OF-responder, InSP saves 30%, 45% and 49% of the messages for a small number of nodes in small, medium and large topologies, respectively. Each of these already significant gains increases with the number of nodes up to a convergence point of 66%, whatever the size of the topology.

2.8. Tokenbucketrate-limiter

2.8.1. Description The details of the token bucket mechanism are available in deliverable D5.3. Hints about its eBPF implementation, as well as partial code chunks and reference to the full source code, were made available in deliverable D3.4. Please refer to those documents for more details. As a short reminder, the token buckets rely on the BEBA extended interface. It uses a sliding window representing the capacity of the bucket, and classes packets according to their arrival time and to the position of this window, and three cases may occur:

Figure 4: Based on packet arrival time, three distinct cases can be considered.

1. The packet hits within the window boundaries. It is forwarded, and the window is

shifted from one step to the “right” on the timeline.





2. The packet hits before (on the left of) the window: the window has already been shifted too many times, meaning that there were too many packets in the time interval. Drop the packet (and do not shift the window).

3. The packet hits after (on the right side of) the window: the window was not shifted

much of late. Reinitialize the window position at the packet arrival time, and forward the packet.

2.8.2. Functional assessment — eBPF version To make sure the eBPF version of the token bucket was correctly implemented, either on Linux or on top of the Fast Path, 6WIND ran the application with both software, with different sizes of packets, and with a number of cores ranging from 1 to 4. The number of flows sent to the switch is equal to the number of cores processing the traffic, so that each core processes its own flow. We used a token bucket with a regeneration rate of 200,000 tokens per second, and obtained the following results:

Figure 5: Token Bucket functional assessment with eBPF

For each number of cores, independently of the size of the packets, the number of forwarded packets per second is roughly equal to 200,000 multiplied by the number of flows, proving that the per-host token bucket works. More precisely, from the value we observed:





• We note that Linux forwards exact multiples of 200,000 packets per second on Linux.

• Meanwhile, the Fast Path lets an additional 15% (or so) number of packets pass. This is due to the way we obtain the arrival time for the packets: the helper we implemented first compute a number in seconds, and then translates it into nanoseconds. But instead of using 109, we take the closest power of 2, thus introducing a deviation. We can obtain something more precise, at the cost of lower performances; for now, we prefer to use the fastest algorithm.

2.9. Portknocking

2.9.1. Description Port knocking is a basic security measure that consists in closing a (TCP|UDP) port on host, until a specific series of packets has been received from the sender. Imagine that the client foo, with IPv4 address 10.3.3.3 on figure below, wants to connect to the server bar (10.1.1.1) through SSH (TCP port 22).

Figure 6: Port knocking: the clients want to reach the server through SSH

The server bar can be directly reached on the Internet, and its administrator does not wish the TCP port 22 to look open to people performing port scans. So, they have enabled port knocking: to open the port to a client—foo for instance—the server (in a typical implementation) must receive from this client a specific sequence of “knocks”. In this example, server bar waits for UDP packets on ports 1111, 2222, 3333, then TCP port 4444, in this order. Once this sequence of packets has been received from a host, the server opens its TCP port 22 for this host, and the SSH connection becomes possible. The graphical representation of this operation, in the shape of a state machine, is represented on the figure below.





Figure 7: Port knocking state machine

That was how a typical port knocking implementations works. With the BEBA architecture, port knocking is run on the same principle, except that the server does not have to hide its port itself. Instead, this is the switch that performs the secret sequence verification, and allows (or discards) traffic addressed to the server. The whole security mechanism occurs at the switch level, thus saving bandwidth and latency.

2.9.2. Functional assessment — eBPF version The code used for the Proof-of-Concept implementation of the eBPF version of the port knocking application is provided in deliverable D3.4. The application has been successfully tested, but for port knocking we do not have particular metrics to measure.

2.10. DDoSDetectionandMitigationThe DDoS application was implemented for the OFSoftSwitch-based BEBA switch. In the controlled virtual environment, it was tested with synthetic traffic (generated well known DDoS patterns). Since the end of the 2016, DDoS mitigation is deployed in live real traffic environment (mirrored traffic from the CESNET infrastructure). The live trial deployment aims on two goals: (1) suitability for long-term deployment, and (2) validation of quality of detection. The long-term deployment validated that the application is stable and runs for more than two months, as seen in the following figure. All of the detected events and other telemetry is being logged for further analysis. An improved version was deployed in the middle of January, hence the change in the graph shape.





The quality of detection was assessed by comparing the application’s output to the outputs of other monitoring systems, that are already in place in the CESNET network. While the following two graphs display data of slightly different nature (all flows on the left and new flows on the right), the correlation is clearly visible. In this case, the attack was a scan of 88 ports of all machines connected to two Austrian universities.

The following Figure shows the memory requirements of the BEBA switch running in a live trial environment. The increased memory requirements correlate to the increased number of observed flows. Due to the performance reasons connected with memory allocation and frees,





the BEBA switch implements a memory pool that frees memory with some latency. This results in higher memory requirements; however, the clear benefit is the improved performance.

2.11. AXEThis use case, originally proposed in [1], is a new approach for L2 spanning tree. The proposed mechanism can be summarized as follows:

1. as in the standard mac learning procedures defined in 802.1d, each switch learns the binding between the MAC source address and the input port from which packets are received. Differently from the standard mechanism, the data packet piggybacks a metadata that represents the sum of the weights of all the links traversed by the packet. In this way, the switch learns the total weight for the paths to all destinations.

2. when a packet is addressed to a mac address that is not in the Forwarding DataBase

(FDB), the switch is forced to flood the packet to all possible output ports. Thus, considering a redundant L2 topology, all switches may receive multiple copies of the same packet coming from different paths.

3. all switches are thus able to bind the best port to each source mac address by simply

choosing the one associated to the path with the least total weight. Nevertheless, in redundant topologies, this flooding mechanism results in inevitable packet duplications. To prevent this, all ingress switches add a metadata to all packets received from a trunk port representing an unique sequence number consisting of the switch identifier and a progressive locally unique number. In this way all egress switches are able to deliver only a copy of the original packet to its destination. Duplicates packets are used for the discovery mechanism and may update transport switches' FDBs.

This application requires to keep track of the following states: (i) a per host state representing the binding between its mac address and the associated output port; (ii) a per packet state used to prevent the delivery of duplicate packets; (iii) a per switch state representing the last sequence number.

2.11.1. Detailedconfiguration

Figure 8





This use case considers two types of switches. The transport switches have only switch-to-switch links and consists of three OPP stages, respectively responsible for: (1) summing the input link weight to each incoming packet; (2) keeping the best reverse path for each flow (identified by the source mac address); (3) performing the final L2 forwarding. The edge switches, executes all functions implemented by the transport switches (with few minimal differences) plus the detection of duplicated packets. Due to the limited space available, only the detailed OPP table configuration of an edge switch is included in Figure 8. Figure 8 (second row) shows the detailed configuration of the four OPP stages of edge switch ES1. For each packet entering Stage 0, the input port is verified. For packets from port 0 (entry #1) an MPLS label consisting of the switch ID and a local unique sequence number (stored in global variable G0 and incremented by 1 for every new packet) is pushed. Packets from a transport port (entry #1) are left unmodified. In both cases packets are forwarded to stage 1.

Figure 9 AXE use case topology

Stage 1 is configured with a look-up and update keys extractor identified by the MPLS label. Any first copy of a packet will match rule #1, while duplicated packets will match rule #2. In this last case, the packet metadata is marked. In all cases the weight packet is forwarded to stage 2. Stage 2 has a look-up and update keys set to the MAC source address and considers the following conditions. C0 verifies if the path weight carried in the packet is less then or equal of the one in the flow context table. C1 checks whether the switch is the ingress or egress one for the current packet. For egress switches the MPLS header is removem, while it is kept for ingress switches. Stage 2 EFSM consists of the following 5 entries:

• #1 is matched when a duplicate packet that belongs to a flow with a better reverse path stored in the flow context table. In this case the idle timeout is updated and the packet is dropped.

ES1 ES2

TS1 TS2

TS3TS4

S1 S2

0

1

2





• #2 is matched when a new packet that belongs to a flow with a better reverse path stored in the flow context table. In this case the packet metadata is marked, the MPLS header is removed and the idle timeout is updated. This entry applies only to an egress switch.

• #3 is equivalent to the previous one but it applies only to ingress switches.

• #4 is matched when a packet that belongs to a flow without a better reverse path

stored in the flow context table. The MPLS header is removed and the idle timeout is updated.

• #5 is equivalent to the previous one but it applies only to ingress switches.

Stage 3 is basically a MAC learning stage, that dynamically updates the binding between the MAC source address and the switch ports (entries #3, #5, #7 and #9). In addition, this stage drops any duplicated packet (entry #1). All remaining entries (#2, #4, #6 and #8) simply forward the packets without updating the output port associated.

2.12. HULAThis use case is an implementation of a distributed load balancing algorithm proposed in [2] and designed for multi-rooted topologies like “leaf-spine” and “fat-tree”, widely used in data centers. Every leaf node evaluates the best next hop for each destination in order to forward each incoming flow through the current best path. In this mechanism, each application flow is divided in bursts of packets separated by a significant time interval (i.e. an inter-packet gap greater than 1 RTT), the so called flowlets. Dividing traffic in flowlets, represents a better approach for balancing network traffic load in presence of heavy hitter flows. Leaf switches are aware of the global path utilization state by synchronising each others with special signaling messages (namely probes). Probes are propagated through the core network in order to monitor the links’ utilization. In our case, they are piggybacked on data traffic and probe generation frequency is determined by the number of flow packets, e.g. every N packets a probe is sent out. Therefore, probes are created by leaves and replicated by spines, in such a way that the information arrives to the other leaves. This information is used to choose a best hop for every destination. This application requires in-switch stateful forwarding support as it needs to store and dynamically update the following states: (i) a per-port state representing the link utilization; (ii) a per flowlet state that binds each traffic burst to a switch output port and it is used to assure the forwarding consistency of all packets belonging to the same flowlet.

2.12.1. Detailedconfiguration





Figure 10

Differently from the original work in [2], in our implementation a probe is a small header mapped on top of MPLS and consists of two fields: (1) leaf_id (3 bit mapped on the tc field of the MPLS header) that contains a leaf identifier; (2) util (20 bit mapped on the MPLS label field) that represents the sum of the weights of the path from the probe source. Figure 10 shows L1 switch configuration. Table 0 dispatches the packets towards the other stages based on whether they are probes, packets from core ports or from server ports. Table 1 is a L2 forwarding DB. Table 2 accounts for probe generation and utilization estimates. These estimations are calculated by means of an Exponentially Weighted Moving Average (EWMA) applied to sampled data rate, implemented as an update operation in OPP. Finally, Table 3 is the stage in which the forwarding algorithm takes place, based on the information gathered from the other nodes. Stage 0. When an MPLS packet enters from a transport ports (rules #1, #2), it is forwarded to Table 3 in order to get information about the utilization stored in the label field. Instead, if an IP packet is matched from ports 1 or 2, that packet will be sent to Stage 2 in which it will be used for EWMA measurements or for being replicated and send back to other leaves as a probe. Indeed, condition 0 states whether a probe has to be triggered. A counter is incremented for each packet matching the entry, i.e. C0 is false (rules #5, #6). When it reaches the maximum number of packets, i.e. C0 is true (rules #3, #4), the counter is reset to zero, MPLS header is pushed and the packet is forwarded to stage 2. Finally, a packet from servers’ ports continues to table 1, rule #7.





Figure 11 HULA use case topology

Stage 1. Stage 1 is stateless because it is a simple L2 FDB Stage 2 This stage is responsible for probe generation and EWMA rate estimation of the core links in the downlink direction. When a packet is matched with metadata equal to 1 (rules #1 and #2), it means that a probe has to be generated. For this purpose, the packet is duplicated by means of an OpenFlow Group action. The first packet is sent back through the input port, with the estimation of the downlink utilization within that port written into mpls_label. The same is done for the second packet but, in this case, the information is relative to the other core port in which it would be sent. Those estimates are stored in two Global Data Variables, i.e. G1 and G2, updated by EWMA measuring at each packet matching rules #3 and #4. At the end, each matching packet is sent through port 3 in order to reach the hosts. Stage 3. At first (lines #1 to #4), every probe is sent here by Stage 0 and the utilization field value is saved in Global data variables in this way: (i) G1 and G3 have to store port 1 and 2 hop costs for the destination leaf 2 (in the case of Leaf 1); (ii) G2 and G3 stores port 1 and 2 hop costs for the second destination leaf, i.e. leaf 3, then probe packets are dropped. Rules #5 to #8 make use of two conditions C1 and C2 that evaluate which port has lower utilization respectively for destination leaf 2 or 3. A packet matching those rules is thus forwarded through the best output port at that moment and that flowlet is now associated with a state (1 or 2 depending on the output). The division in flowlets is trivial: in the set_state action is also set the idle timeout for the flow, that should be greater or equal to the RTT of the whole network. Finally (rules #9 to #12), a flowlet associated to a state continues flowing through its output port until the timeout expires. Spines implementation is simpler with respect to the one presented above. It only contains two stages: stage 0 is composed of a MAC address table like the Table 1 in the leaf implementation; stage 1 is similar to Table 2 but in spines' case, probes received from leaves are replicated with the addition of the spine's own EWMA estimation of the link, and then replicated to the other leaves. The detailed spine switches implementation is omitted in Figure 10

L1 L3L2

S1 S2

S1 S2 S3





3. PerformanceassessmentIn this section we analyse the performance characteristics of different BEBA switch implementations.

3.1. OFSoftSwitchThis section reports on the performance of the BEBA node implemented on top of the OFSoftSwitch (OFSS) software platform. Full details about the implementation and software acceleration techniques undertaken are extensively reported in deliverable D3.4. In the following, we focus on the experimental setup for performance evaluation and on the actual switching speed achieved under different traffic scenarios and processing configurations.

3.1.1. Experimental setup The next Figure shows the experimental testbed for the performance assessment campaign.

The BEBA node built on top of OFSS and the traffic generator run onto two identical PCs with 8-core Intel Xeon E51660V3 CPUs (3.0GHz), equipped with a pair of identical Intel 82599 10G NICs. Both systems run a Linux Debian stable distribution (kernel v. 3.16). A third server runs the Ryu controller and is connected to the switch’s server using 1G control network interface. The tests will mainly assess the forwarding speed that the BEBA node will reach when running applications with increasing processing burden. Indeed, the first set of tests will deal with pure Openflow switching capability and will mainly measure the performance acceleration of OFSS when used as pure switch. The next sets of tests will instead assess the forwarding capability of actual BEBA applications, running both stateless and stateful operations.

3.1.2. OpenFlow performance The first set of experiments are pure speed tests to benchmark the performance of the accelerated version of OFSS (aOFSS) when running a standard OpenFlow pipeline. Figure 12

Traffic GeneratorBEBA node

BEBA Controller

2 x 10G

1G





shows the achieved throughput, when varying the number of processing cores and the input packet sizes. Both the original OFSS performance and line rate limits are also reported for comparison. The first immediate result is represented by the dramatic increase of forwarding rate achieved by the accelerated BEBA node with respect to the plain OFSS. In fact, the figure proves that line rate forwarding speed is nearly attained in all cases and that performance scales quite well with the number of processing instances of aOFSS even in the most critical case of shortest packet size traffic. The increase of performance is more clearly depicted in Figure 13 and shows an excellent 90x acceleration factor in the worst case of minimum sized packets with 4 processing cores.

Figure 12: OpenFlow pipeline throughput

0

2

4

6

8

10

12

14

16

64 128 256 512 1024 1500

Thro

ughp

ut (M

pps)

Packet Size (Bytes)

Line RateOFSS

aOFSS - 1 core aOFSS - 2 coresaOFSS - 3 coresaOFSS - 4 cores





Figure 13: Speedup factor

3.1.3. BEBA stateless processing The second set of experiments aims at assessing the aOFSS performance when running OPP, including multiple pipeline configurations. Figure 14 shows the throughput achieved by aOFSS when using stateless BEBA stages. Obviously, the processing stages have an impact on the overall forwarding rate of the node. However, the system still hits line rate for packets of realistic sizes bigger than 128B. In the most critical case of shortest packet size, performance decreases linearly with the number of stages, but still reaching well above 10 Mpps with 4 running cores. Notice that the performance for one stage are comparable to the ones of OpenFlow. In fact, a stateless OPP stage is functionally equivalent to an OpenFlow table.

0 10 20 30 40 50 60 70 80 90

100

64 128 256 512 1024 1500

Spee

dup

Fact

or

Packet Size (Bytes)

1 Core2 Cores3 Cores4 Cores

Line Rate





Figure 14: Stateless BEBA stages throughput

3.1.4. BEBA stateful processing Figure 15 shows the performance for a pipeline of stateful BEBA stages. As expected, performance decreases, with line rate achieved for packet sizes bigger than 256B. The degradation is more significant as the number of stages increases. However, we remark that in our test we measured a somewhat worst–case behavior, that is, every packet performs a state update. In fact, in real use cases, most of the packets perform just a state lookup, with only few of them actually triggering a state modification.

0

5

10

15

20

Line Rate

1 Stage2 Stages3 Stages4 stages

Line Rate


Line Rate


Line Rate


Line Rate


Line Rate


64 Bytes

128 Bytes

256 Bytes

512 Bytes1024 Bytes

1500 BytesThro

ughp

ut (M

pps)

1 core2 cores3 cores4 cores





Figure 15: Stateful BEBA stages throughput

3.2. eBPF

3.2.1. BEBA with eBPF eBPF Linux subsystem, used in the kernel for filtering packets as well as for tracing functions and events, has been identified as a possible target for the BEBA abstract interface. In delivery D3.4, we explained how 6WIND:

• Validated the feasibility of implementing the BEBA interface as eBPF programs.

• Experimented with, and developed solutions to run eBPF programs in user space, in particular on top of its 6WINDGate™ software.

This section presents the performance tests for this implementation. As of this writing, the tests are still being performed. We obtained a first set of results, but the code is still receiving modifications and optimizations, and additional performance assessment are to be realized.

3.2.2. Parameters to test In order to assess that we could implement the BEBA interface as eBPF programs, two example use cases were implemented: port knocking and token bucket. While port knocking is a didactic introductory example to stateful processing, it does not really represent an industrial use case of an application that should sustain heavy loads and would need an accelerated datapath. Instead, the per-host token bucket use case (see Section 2) is a better example of a real-life need: it can be setup on a host in order to limit the number of packets from a host that will be

0

5

10

15

20

Line Rate


Line Rate


Line Rate


Line Rate


Line Rate


Line Rate


64 Bytes

128 Bytes

256 Bytes

512 Bytes1024 Bytes

1500 BytesThro

ughp

ut (M

pps)

1 core2 cores3 cores4 cores





accepted (forwarded) during a period of a given length. As a consequence, this use case will be used in this section for tests involving stateful processing of packets with eBPF. We want to test several parameters during those tests. Throughput, obviously, is an essential piece of data: what rate is the switch capable to forward at? But beside evaluating transmission rate under normal forwarding circumstances, we need, of course, to evaluate the impact of stateful processing on the throughput. The tests have to be designed is such a way that we can evaluate this overhead in terms of performance degradation. Note that the work on eBPF focuses on data plane, and no controller is involved in the tests. eBPF programs will simply be installed and attached to the relevant interfaces through the command line.

3.2.3. Topology We have started to conduct internal tests at 6WIND, to evaluate the performances we can obtain with an eBPF-enabled software. We use a very simple topology, in which a traffic generator has two ports connected to two interfaces of a blade. One port is used to send unidirectional traffic to the server, which acts as a switch (or a router), and forwards L2 (respectively L3) traffic back to the generator, on its second port. The links support up to 10 Gbps data. The server runs an Intel® Xeon® Processor E5-2699 v3 (45 M Cache, 2.30 GHz, 18 cores), running Ubuntu 16.04 LTS. Below is an example of this topology: +---------------------+ A | 10 Gbps V B +-------------+ +-------------+ | Ixia XM2 | | Server | +-------------+ +-------------+ D ^ 10 Gbps | C +---------------------+

The test being conducted include: Back-to-back performance measurement (no switch) A simple test can be lead to assess that the baseline of the traffic generator reaches wirespeed. The topology is temporary simplified, as on the below diagram: +---------------------+ A | 10 Gbps |+-------------+ || Ixia XM2 | |+-------------+ | D ^ | +---------------------+

Simple L3 forwarding Testing the L3 capabilities for both Linux and the 6WINDGate™ on the topology provides a





reference point for later tests. It can be performed with a single flow (processed by a single CPU) as well as multiple ingress flows (that should trigger the use of multiple CPUs through RSS). eBPF packet drop / simple forward Introducing eBPF, be it on Linux with tc or in the fast path, is susceptible to alter the statistics. As a consequence, we may want to run tests with no filtering, either dropping or forwarding all packets, but involving the setup and execution of a minimal eBPF program (that would only return a value and exit). On Linux, the eBPF program is attached to tc interface hook. The program is injected with the tc utility. On 6WIND’s fast path, the management command tool is used to inject the program and to initialize map contents. Token bucket At last, this is the interesting part! We want to run the token bucket and to compare bit rate and latency with simple forwarding. The token bucket can be run in several modes:

• With a capacity of at least 200,000 tokens, and a regeneration rate of 200,000 tokens per second, for example, it makes us able to see if state transition can stand this rate, and if the application runs as expected. The results of this test were presented in the functional assessment section.

• With a modified version of the token bucket, that will perform table lookups and

updates normally, except that it will always return the same value (always allowing packet forwarding), we can evaluate how many total packets can be handled by the mechanism over a certain period of time.

3.2.4. Results Here are the results we obtained so far. We ran each test for packets of 64, 128, 256 then 512 bytes, and with 1, 2, 3 and 4 cores processing the traffic. Each time, the number of flows (hence of buckets) is equal to the number of cores processing the traffic. And of course, we ran those tests both on Linux and with 6WIND software. Back-to-back performance measurement (no switch) The tests showed that the Ixia generator is able to sustain line rate (10 Gbps throughput) for all lengths of packets. Simple L3 forwarding This is the first set of tests, without any eBPF program involved.





Figure 16: eBPF performance assessment: no eBPF program

As expected, the 6WINDGate™ has much better performances than Linux. The rate is close to wirespeed in all cases, save for 64-byte long packets on a single core. Meanwhile, Linux only reaches line rate with four cores, and packets of at least 256 bytes. eBPF packet drop / simple forward Now with a minimal eBPF program, that simply returns a constant value (indicating to tc that the packet should be forwarded, so that we can measure the throughput). The results are nearly identical to the previous test:





Figure 17: eBPF performance assessment: minimal eBPF program

Both the 6WINDGate™ and Linux suffer a slight decrease in performances. Not much: in both case, the measured rate is closed to the throughput obtained without any tc filtering. Token bucket At last we add the token bucket. If we ask the token bucket to actually drop packet, it is hard to evaluate the throughput. It is a good way to see if the application works, though, and so it was presented in Section 2 for functional assessment. Here instead, we use a “transparent” token bucket, that normally processes the packets, but still forward all of them in the end.

Figure 18: eBPF performance assessment: “transparent” token bucket





This time we observe a significant drop in performance caused by stateful processing of the packets. As for the other implementations of BEBA, this is mainly due to two factors:

• Of course, there is the fact that we add actual advanced processing of the packets, which necessarily makes the path of a packet longer in regards to previous tests in this section.

• Also, we rely on eBPF hash maps, and perform three hashes for each packet, which is

expensive in cycles. Regardless of these, we can also observe that the 6WINDGate™ generally has decreasing performances as the number of cores rises. The reason for this is simple: to correctly update a value in the state table, we currently use a spinlock. It currently encompasses all the update function, including the computation of the hash. So, all threads spend most of their time waiting for this lock to be free, thus creating an important bottleneck. By contrast, with one core, there is, of course, no such problem. The 6WINDGate keeps a high rate, with more than 50% of its initial (with no eBPF) throughput, at around 2.40 Gbps when Linux, with the same eBPF application, drops at 0.40 Gbps.

3.2.5. Improvements and future tests Support for eBPF on the 6WINDGate™ is still being improved and optimized. The spinlock, in particular, is at the top position of the list of items to change in order to bring decent performances to multicore execution of the application. We already fixed an important memory leak, and removed a bottleneck in the computation of the current time (in the helper needed to obtain packet arrival time of packets for the token bucket). Some thought is also given to the hash algorithm, and we believe we could improve it to go faster. This, of course, call for new tests in the future. In particular, we intend to run the same kind of tests on CESNET’s testbed. Well, not for the back-to-back experiment, obviously, since we will not change the setup of the virtual lab. Otherwise, a simple topology, featuring a host and a switch as the device-under-test, will allow us to repeat the same experiments.





Figure 19: CESNET’s virtual lab topology for testing TCP SYN flood protection

We intend to run this setup in shared mode, with traffic produced by the traffic generator, and to ask the assistance of CESNET to have it run with live traffic.

3.3. mSwitchTo evaluate the performance of mSwitch-OPP we implement a simple use case consisting of a classifier that distinguishes shortlived flows from long-lived flows by considering as long any flow that has transmitted at least N packets. We also add a simple clean-up extension that resets flow state after it is idle for 10 seconds. Notice that this description was already included in D3.4, and it is repeated in here for convenience. In terms of setup, we use an mSwitch instance running on an x86 server with two 10GbE NICs (Intel x540) and 255 flows installed. Packet size is always 60 bytes excluding ethernet CRC. We use Linux 4.9.0 with an Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40GHz CPU (4 cores). For reference, to insert the flow rules into the OPP tables we use the oppctl command with the following parameters: for i in `seq 1 255`; do ./deployed/opp/oppctl 1 10.0.0.2 50000 10.0.0.$i 60000 udp; done With all of this in place we measure throughput for an increasing number of CPU cores up to the 4 available on the server. In each case we generate the same number of flows as cores, although it is worth pointing out that we ran an experiment with 255 flows and the results did not show any substantial differences. Further, to see whether OPP is CPU intensive we perform the tests using the CPU’s maximum frequency (3.4GHz) but also a down-clocked frequency of 1GHz.





Figure 20: Performance of the mswitch-opp running the long-lived flow use case for

multiple CPU cores and difference CPU frequencies. Packets are minimum-sized.

The results are shown in Figure 20 and show that mSwitch-OPP provides high throughput (up to a maximum of 13.2 Mp/s, close to the 10Gb line rate of 14.8 Mp/s) and scalability with respect to increasing number of CPU cores, scalability that is likely to improve with the introduction of cache isolation technologies such as Intel’s CAT. Further, as can be seen from the difference in rates between the two curves, OPP processing is CPU intensive, with throughput improving significantly with higher CPU frequencies. Finally, we note that the number of flows in actual traffic appears not to matter in these experiments: in the 1GHz case, increasing the number of flows to 255 so that all the flow entries installed are evenly looked-up resulted in a negligible difference in throughput with respect to the four flow case. In short, mSwitch-OPP shows the feasibility of running the OPP abstraction at high speed in software on standard, off-the-shelf x86 server. This represents a good complement to OPP’s hardware implementation.

0 2 4 6 8

10 12 14

1 2 3 4

Thro

ughp

ut (M

p/s)

Number of CPU cores

1GHz3.4GHz





4. Conclusions This deliverable shows that the BEBA abstraction can be used to implement a number of real-world use cases. We tested the abstraction both in emulated environments and in hardware proof-of-concept environments. Also, we performed tests using real-world traffic traces and compared BEBA’s implementation results with those provided by legacy systems. We conclude that the BEBA abstraction is solid and useful to replace legacy implementations of several network applications. Our performance assessment further shows that different approaches can be used to implement a BEBA abstraction in software. In particular, the assessment shows that the abstraction implementation can scale to a performance level of >8Mpps when using a very optimized software switch implementation such as mSwitch. Furthermore, it shows that good performance of several Mpps can be achieved also with fast prototyping technologies such as OfSoftSwitch and with highly portable Linux implementations, such as eBPF.





5. Bibliography [1] McCauley, James, et al. ”Taking an AXE to L2 Spanning Trees.”Proceedings of the 14th ACM Workshop on Hot Topics in Networks.ACM, 2015. [2] Katta, Naga, et al. ”Hula: Scalable load balancing using programmabledata planes.” Proc. of the Symposium on SDN Research. ACM, 2016 [3] OfSoftSwitch repository https://github.com/CPqD/ofsoftswitch13 [4] Linux man page, eBPF http://man7.org/linux/man-pages/man2/bpf.2.html [5] Michio Honda, Felipe Huici, Giuseppe Lettieri, and Luigi Rizzo. 2015. mSwitch: a highly-scalable, modular software switch. In Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research (SOSR '15). ACM, New York, NY, USA, , Article 1 , 13 pages. DOI: http://dx.doi.org/10.1145/2774993.2775065 [6] Roberto Bifulco, Julien Boite, Mathieu Bouet, and Fabian Schneider. 2016. Improving SDN with InSPired Switches. In Proceedings of the Symposium on SDN Research (SOSR '16). ACM, New York, NY, USA, , Article 11 , 12 pages. DOI: https://doi.org/10.1145/2890955.2890962

Documents

BEBA Behavioural Based Forwarding Deliverable … BEhavioural BAsed forwarding BEBA Behavioural Based Forwarding Grant Agreement: 644122 BEBA/WP6 – D6.2 Version: 1.0 Page 1 …