90
D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator and I/F, Network Resilience Framework)

D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Engineering Release 2

(Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules

Generator and I/F, Network Resilience Framework)

Page 2: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 2 of 90

Document Number D5.3

Status Completed

Work Package WP5

Deliverable Type Report

Date of Delivery 17.07.2017

Responsible Unit Fraunhofer

Editors Marius Corici (Fraunhofer)

Contributors (alphabetic order)

Haytham Assem (IBM)

Jaafar Bendriss (Orange)

Teodora Sandra Buda (IBM)

Bora Caglayan (IBM)

Eleonora Cau (Fraunhofer)

Marius Corici (Fraunhofer)

Fabrizio Granelli (UNITN)

Iryna Haponchyk (UNITN)

Alessandro Moschitti (UNITN)

Andrea Passerini (UNITN)

Antonio Pastor (TID)

Daniel-Ilie Gheorghe Pop (Fraunhofer)

Benjamin Reichel (TUB)

Ranjan Shrestha (TUB)

Mikhail Smirnov (Fraunhofer)

Zsolt Szabo (Fraunhofer)

Martin Tolan (WIT)

Kateryna Tymoshenko (UNITN)

Imen Grida Ben Yahya (Orange)

Reviewers (alphabetic order)

Alberto Mozo (UPM)

Martin Tolan (WIT)

Dissemination level PU

Page 3: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 3 of 90

Change History

Version Date Status Editor (Unit) Description

0.1 15.05.2017 Draft Marius Corici (Fraunhofer)

Provided initial template and deliverable structure based on the intended contributions

0.2 3.06.2017 Draft Ranjan Shrestha (TUB) Integrated the initial content of the partners

0.3 21.06.2017 Draft Ranjan Shrestha (TUB) Integrated the content from FOKUS, TUB, IBM, Orange and UNITN

0.4 27.06.2017 Draft Ranjan Shrestha (TUB) updated security section and some fig. references fixed

0.5 04.07.2017 Draft Ranjan Shrestha (TUB) Completed the performance degradation section of FOKUS, added executive summary, introduction and conclusions

0.6 05.07.2017 Draft Ranjan Shrestha (TUB) Editorial modifications

0.7 09.07.2017 Draft Ranjan Shrestha (TUB) Completing the missing sections

0.8 10.07.2017 Draft Ranjan Shrestha (TUB) Adding the section from TSSG

0.9 11.07.2017 Draft Marius Corici (Fraunhofer)

Making editorial repairs, re-working introduction and conclusions.

0.10 12.07.2017 Draft Ranjan Shrestha (TUB) Other editorial changes. Modification of the Experience section to include the insight from the testbeds

0.11 12.07.2017 Draft Marius Corici (Fraunhofer)

Cleaned up version for internal review

0.12 12.07.2017 Draft Martin Tolan (WIT) Comments and proof reading

0.13 13.07.2017 Draft Alberto Mozo (UPM) Comments and review

0.14 13.07.2017 Draft Ranjan Shrestha (TUB) Integration of comments

0.15 14.07.2017 Draft Ranjan Shrestha (TUB) Integration of the responses to the comments

1.0 17.07.2017 Final Marius Corici (Fraunhofer)

Cleaned up version

Page 4: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 4 of 90

Executive Summary

This deliverable represents the second report on the implementation within

the WP5 of CogNet. The deliverable reports on three types of activities

which were developed in the form of end-to-end proof of concepts and

verified the development in the form of testbeds:

• The acquisition of the appropriate amount of data used for training and

for evaluating the machine learning algorithms (only update to the first

engineering report D5.2)

• The development of machine learning mechanisms which use the

acquired data to give insight into specific management features (only

update on the first engineering report D5.2 and using the algorithms

from D3.2 and D3.3, specifically developed for WP5)

• The development of proof-of concept actuation mechanisms proving

the effectiveness of the machine learning techniques as added value to

the current network management mechanisms.

A further evaluation of the implemented mechanisms as well as an update

on possible fixes of the testbeds will be included in D5.4 and concluding

the work of the WP5.

Furthering the evolution of the work package towards the different types of

security and network reliability, the work is split into the same three major

directions as it was within deliverable D5.2: subscriber communication

security, network security and network resilience. The deliverable follows

the evolution of the testbeds described in D5.2 with the evolution of the

previous described three security testbeds and two resilience testbeds

including the mitigation actions resulting from the recommendations of the

machine learning.

To set up the plan for the development of a proper evaluation of the

different mechanisms, as a conclusion, a new section on the concept of

“Experience” is integrated (Section 4), describing an evaluation mechanism

proper to the specifics of machine learning for network management in the

area of security and network resilience.

Page 5: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 5 of 90

Table of Contents

1. Introduction..................................................................................................................... 11

1.1. Motivation, Objective and Scope ........................................................................................................... 11

2. Distributed Security Enablement Testbeds .................................................................. 13

2.1. Distributed Security Enablement Testbed .......................................................................................... 13

Scope ....................................................................................................................................................... 13

Short description (update of D5.2) .............................................................................................. 13

Actual items implemented in CogNet ........................................................................................ 14

Testbed Implementation ................................................................................................................. 15

Initial Experimentation results ....................................................................................................... 19

Experience ............................................................................................................................................. 19

Conclusions and possible next steps .......................................................................................... 20

User Manual ......................................................................................................................................... 21

2.2. Honeynet Testbed ....................................................................................................................................... 28

Scope ....................................................................................................................................................... 28

Short description (update of D5.2) .............................................................................................. 28

Actual items implemented in CogNet ........................................................................................ 28

Testbed Implementation ................................................................................................................. 30

Initial Experimentation results ....................................................................................................... 32

Experience ............................................................................................................................................. 33

Conclusions and possible next steps .......................................................................................... 33

User Manual ......................................................................................................................................... 34

2.3. NFV Security Anomaly Detection Testbed ......................................................................................... 35

Scope ....................................................................................................................................................... 35

Testbed Description (update of D5.2) ........................................................................................ 36

Actual items implemented in CogNet ........................................................................................ 41

Initial Experimentation results ....................................................................................................... 42

Experience ............................................................................................................................................. 42

Conclusions and possible next steps .......................................................................................... 43

User Manual ......................................................................................................................................... 43

2.4. LSP-CLUSTER: Network Traffic Clustering towards Anomaly Detection ................................. 43

Page 6: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 6 of 90

Scope ....................................................................................................................................................... 43

Short description (update of D5.2) .............................................................................................. 44

Testbed Implementation ................................................................................................................. 44

Initial Experimentation results ....................................................................................................... 46

Experience ............................................................................................................................................. 46

Conclusions and possible next steps .......................................................................................... 46

User Manual ......................................................................................................................................... 47

3. Network Resilience Testbeds ......................................................................................... 49

3.1. Performance degradation (Dense Urban Area Testbed) .............................................................. 49

Scope ....................................................................................................................................................... 49

Testbed Implementation ................................................................................................................. 49

Initial Experimentation results ....................................................................................................... 52

Actual items implemented in CogNet ........................................................................................ 55

Experience ............................................................................................................................................. 55

Conclusions and possible next steps .......................................................................................... 56

User Manual ......................................................................................................................................... 56

3.2. SLA ..................................................................................................................................................................... 60

Scope ....................................................................................................................................................... 60

Short description (update of D5.2) .............................................................................................. 60

Actual items implemented in CogNet ........................................................................................ 60

Testbed Implementation ................................................................................................................. 61

Initial Experimentation results ....................................................................................................... 70

Experience ............................................................................................................................................. 72

Conclusions and possible next steps .......................................................................................... 73

User Manual ......................................................................................................................................... 73

4. Types of Experience ........................................................................................................ 79

4.1. Introduction .................................................................................................................................................... 79

4.2. Per use-case types of experience ............................................................................................................. 80

Monitoring ............................................................................................................................................ 83

Detection ............................................................................................................................................... 84

Actuation ............................................................................................................................................... 85

Page 7: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 7 of 90

4.3. Executive summary and Conclusions ..................................................................................................... 86

5. Conclusions and Further Work ...................................................................................... 87

Glossary, Acronyms and Definitions .................................................................................... 88

References ............................................................................................................................... 90

Page 8: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 8 of 90

List of Tables

Table 1-1 Short overview of the deliverable 5.3 .................................................................................................. 12

Table 2-1 Rank for different sources of experience ........................................................................................... 20

Table 2-2 Set of netflow attributes ........................................................................................................................... 31

Table 2-3 Anomalies output detected by the isolation forest identifying source address of offending client. .................................................................................................................................................................................... 33

Table 2-4 Rank for different sources of experience ........................................................................................... 33

Table 2-5 Rank for different sources of experience ........................................................................................... 43

Table 3-1 Rank for different sources of experience ........................................................................................... 55

Table 3-2 Data summary .............................................................................................................................................. 62

Table 3-3 Subset of Monasca monitored metrics .............................................................................................. 62

Table 3-4 Components of Ponder 2......................................................................................................................... 68

Table 3-5 Comparison of results yielded from different LSTM architectures .......................................... 72

Table 3-6 MCDT Evaluation of the Test Set .......................................................................................................... 72

Table 3-7 Rank for different sources of experience ........................................................................................... 73

Table 4-1 Experience template .................................................................................................................................. 80

Table 4-2 Use case based interpretation of different source of experience on Monitoring .............. 81

Table 4-3 Use case based interpretation of different source of experience on Detection ................. 82

Table 4-4 Use case based interpretation of different source of experience on Actuation ................. 82

Page 9: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 9 of 90

List of Figures:

Figure 1-1 – List of testbeds and components described in this deliverable .......................................... 11

Figure 2-1 Distributed Security Enablement in Service Provider Infrastructure ..................................... 13

Figure 2-2 Distributed Security Enablement Testbed ....................................................................................... 15

Figure 2-3 Traffic generator with attack overlay ................................................................................................. 16

Figure 2-4 Highlighted first attack ............................................................................................................................ 17

Figure 2-5 SFlow Packet Contents ............................................................................................................................ 17

Figure 2-6 Service Function Chains in the edge security zone ...................................................................... 19

Figure 2-7 Distributed Security Enablement specific components .............................................................. 21

Figure 2-8 Fuel Installer Screen.................................................................................................................................. 24

Figure 2-9 Hardware selection for OPNFV Environment Installation .......................................................... 25

Figure 2-10 Install of the OPNFV for the DSE Testbed ..................................................................................... 26

Figure 2-11 OPNFV for the DSE Testbed successfully installed .................................................................... 26

Figure 2-12 Honeynet integration in Mouseworld............................................................................................. 30

Figure 2-13 Traffic patterns examples ..................................................................................................................... 31

Figure 2-14 - Adding Security with ML in the Cloud Infrastructure ............................................................ 36

Figure 2-15 NFV security testbed ............................................................................................................................. 37

Figure 2-16 Example of anomaly detection .......................................................................................................... 39

Figure 2-17 Graph of pairwise links between transmission nodes .............................................................. 46

Figure 2-18 Spanning tree ........................................................................................................................................... 46

Figure 3-1 Testbed showing MME-LB managing multiple MMEs for handling performance degradation ....................................................................................................................................................................... 50

Figure 3-2 Actual, predicted and squared error on abnormal data shown for mme_system_cpu_util[user] with anomalies represented by dashed vertical line. ................................ 53

Figure 3-3 Actual, predicted and squared error on abnormal data shown for mme_system_cpu_util[user]. ........................................................................................................................................ 54

Figure 3-4 Anomalies detected by the model on abnormal dataset with sudden spikes for mme_system_cpu_util[user]. ........................................................................................................................................ 54

Figure 3-5 Normal behaviour for metric mme_system_cpu_util[user] which represents percentage of CPU utilization. ............................................................................................................................................................ 57

Figure 3-6 Abnormal data with anomalies injected at the dashed lines at different time as indicated. ................................................................................................................................................................................................ 58

Page 10: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 10 of 90

Figure 3-7 Abnormal data with sudden spikes of utilization with missing labels from dataset (the time of the anomalies is unknown). ......................................................................................................................... 58

Figure 3-8 Run of DTADE with Sample Data ........................................................................................................ 59

Figure 3-9 SLA Architecture ........................................................................................................................................ 61

Figure 3-10 Example of time series decomposition .......................................................................................... 63

Figure 3-11 Re-scale to make time series stationary: mean to 0 and standard deviation to 1 ........ 64

Figure 3-12 (a) - Intra VNFC correlations. (b) Inter VNFC correlations. (c) - Box plot of the Bono VNFC system-level metrics rescaled between 0 and 10. The blue box marks the quartiles of the distribution; the red line is the median and the crosses show the outliers. (d) Dividing Time series by first value to track the evolution. ........................................................................................................................ 65

Figure 3-13 Ponder2 framework ............................................................................................................................... 68

Figure 3-14 Relationship between RMI and XML Blaster ................................................................................ 69

Figure 3-15 The Incoming signal, Training data, Test set are indicated by Blue, Green and Red colors respectively ........................................................................................................................................................................ 71

Figure 3-16 Clearwater login dashboard ............................................................................................................... 74

Figure 3-17 Clearwater account configuration .................................................................................................... 74

Figure 3-18 Creating subscribers .............................................................................................................................. 74

Figure 3-19 Configuration Android SIP client ...................................................................................................... 75

Figure 4-1 Comparison based on impact of Time, Task and Context as sources of ML experience during Monitoring .......................................................................................................................................................... 83

Figure 4-2 Comparison based on impact of Time, Task and Context as sources of ML experience during Detection .............................................................................................................................................................. 84

Figure 4-3 Comparison based on impact of Time, Task and Context as sources of ML experience during Actuation .............................................................................................................................................................. 85

Page 11: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 11 of 90

1. Introduction

1.1. Motivation, Objective and Scope In this deliverable a set of testbeds are described that showcase the advantages of machine learning for network management in the directions of subscriber communication security, network security and network reliability and performance. The list of the testbeds and components described in this deliverable is illustrated in Figure 1-1.

Figure 1-1 – List of testbeds and components described in this deliverable

The end-to-end processing is split into three parts: monitoring, detection and actuation. For some of the added value management features included in this deliverable, the same machine learning or the same actuation is used. This is properly marked through the deliverable when appropriate.

The deliverable includes the following sections:

Sections Description

2.1 Distributed Security Enablement

The testbed includes the updates on the monitoring, detection and mitigation actions. This represents an update of Section 2.1 of D5.2.

Page 12: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 12 of 90

2.2 Honeynet The testbed includes the updates on monitoring and detection. This represents the update of the equivalent Section 2.2 from D5.2.

2.3 NFV Security Anomaly Detection

The testbed includes the updates on monitoring and mitigation actions, representing an update of the equivalent Section 2.3 from D5.2. Section 2.4 represents an update of the equivalent Section 2.4 from D5.2 on the network traffic clustering towards anomaly detection.

3.1 Resilience testbed The testbed represents an update towards an end-to-end PoC of the Section 3 testbed from D5.2.

3.2 Media SLA A new section on media SLA compliance issue with a proactive solution using machine learning.

4 Evaluation framework / Experience

It continues the development of the evaluation framework as presented in Section 4.3 of D5.2 towards addressing the specifics of the use cases.

Table 1-1 Short overview of the deliverable 5.3

These testbeds are completed in regard to the functionality implemented. Additional small features

and modifications may appear in the evaluation phase due to slight modifications in terms of the

requirements. These modifications together with the evaluation results and their assessment will be

reported in D5.4.

Page 13: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 13 of 90

2. Distributed Security Enablement Testbeds

2.1. Distributed Security Enablement Testbed

Scope The Distributed Security Enablement testbed was first proposed in D5.1 and subsequently documented with a proposed architecture in D5.2. The primary objective of the distributed security enablement testbed is for the creation, deployment and maintenance of security based Service Function Chains as a means to separate the processing of the subscriber requests to different security zones where different trust into the subscriber usage is considered. In a typical scenario, these service functions chains would be deployed at the edge of a service provider’s network and will have the effort of enhancing the integrity of the network by providing distributed security zones that can act on security related events and enforce dynamically generated policies. Figure 2-1 depicts the relationship of the distributed security enablement component within the context of the service providers’ network and how it is poised for operation at the edge of that network.

This section will provide an update on the current status of the distributed security enablement testbed, the work that has been carried out, current work in progress and finally the work that is planned for the final stage of the testbed.

Short description (update of D5.2) As reported in D5.2, the initial architecture for the distributed security enablement testbed was described which included details of how metric and monitoring data is injected into the testbed (monitoring) for processing within the CogNet Smart Engine (Analyse) and finally forwarded to the Policy Management module for the actuation of corrective measures suitable for the detected attack type (plan & execute).

Figure 2-1 Distributed Security Enablement in Service Provider Infrastructure

Page 14: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 14 of 90

In this phase of the distributed security enablement testbed development the main emphasis has been on the creation of a reliable tenant network, consistent metric data generation via probes and the collection and feature extraction of that metric data as required for each attack type under investigation. This allows for generation of multiple datasets where different attack types can be captured. The resultant datasets are then used as input into the machine learning module that allows for the evaluation of different algorithms. The performance of each algorithm is then monitored thus allows for selection of the most appropriate model for each attack type under investigation.

In the next phase it is hoped that the machine learning algorithm’s selection can be finalised with a comparable detection rate across all of the attack types thereby providing consistent reliability for all attack types.

Actual items implemented in CogNet The original specification described the testbed utilisation of virtual machines for the generation and monitoring of network traffic. As the investigation proceeds and development matures optimisations are introduced that will not only streamline the testbed traffic generation, but will have far reaching benefits for final deployment of the scenario into a live environment. The modifications that have been implemented across the distributed security enablement testbed are described in the following sections.

2.1.3.1 Docker Containers

Each of the virtualised nodes used for the generation of normal traffic patterns have been migrated across into Docker containers. This allows for the rapid deployment of multiple containers within a secure environment providing the following benefits:

• Elimination of external noise preventing contamination of traffic patterns within the captured metric datasets for monitoring.

• Rapid scaling of nodes within Docker allowing for increased and decreased traffic demands. • Sandboxed environment preventing the accidental “escape” of attack vectors from the

environment under test.

2.1.3.2 Attack Utilities

The Kali server1 was originally specified as a source of some or all of the attack types that constitute this scenario. There are several utilities2 available under the Kali hood that provides multiple types and variants of attack vectors. The Kali distribution has also been incorporated into the distributed security enablement testbed in the form or another docker container. This container is deployed within the same environment as the normal traffic generating nodes and can be configured to attack one or more nodes within the sandboxed environment. This flexibility allows for multiple

1 https://www.kali.org/ 2 https://tools.kali.org/tools-listing

Page 15: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 15 of 90

instances of the attack container to be deployed thereby providing for various mutations of the attacks to exist.

2.1.3.3 DSE Gateway

The distributed security enablement gateway provides a type of feature extraction service whereby messages received from the network probes that are in the SFlow3 format are transformed and extracted into the various formats and structure as defined by the distributed security enablement gateway property file. The application for this extraction has been upgraded to make use of the open source utility “sflowtool"4. This addition utilises standard tooling for the rapid integration to the distributed security enablement testbed.

2.1.3.4 NFVI Environment Selection

The OPNFV5 platform has been selected to act as the Network Function Virtualisation Infrastructure environment. This environment has already been rolled out onto the testbed datacentre infrastructure and is facilitating the design, implementation and deployment of the various links in the service function chains.

Testbed Implementation The distributed security enablement testbed is captured in Figure 2-2.

3 http://sflow.org/sflow_version_5.txt 4 https://github.com/sflow/sflowtool 5 https://www.opnfv.org/

Figure 2-2 Distributed Security Enablement Testbed

Page 16: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 16 of 90

It can be seen that the simulated attack environment is utilising Docker6 to build up the array of attack types and allows for the dynamic configuration of the normal network traffic nodes that is to be exposed to the various attack types. This also supports the harvesting of both the normal and attack traffic patterns that facilitate the selection and verification of various machine learning algorithms. These offline data captures also allow for benchmarking of various algorithms under different conditions of nodes and load that can give valuable insights into their performance.

2.1.4.1 Monitoring

The monitoring for the testbed is achieved by the introduction of the SFlow generator. The SFlow probe is inserted into the normal traffic pattern and generates the monitoring data based on all of the traffic flows. In the next version of the testbed the SFlow probe will be implemented as a link in the service function chain and will have complete visibility to all of the traffic within that chain. The output from the SFlow generator is forwarded to a dedicated VM whose purpose is to transform the SFlow packets into a useable format and extract the relevant features from the resultant data steam. The extracted features are specific for each attack type that is under investigation and are forwarded to a Kafka queue specific for each attack type under investigation.

Figure 2-3 and Figure 2-4 highlights the normal traffic generation between the nodes 192.168.1.1 (server node) & 192.168.1.2 (client node) and where the 192.168.1.3 (rogue) node begins attacking the server (in this case the 192.168.1.1 node). All communications between the client and server is interrupted as the rouge node has consumed all of the available bandwidth thereby simulating a DOS attack.

6 https://www.docker.com/

Figure 2-3 Traffic generator with attack overlay

Page 17: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 17 of 90

As the SFlow packets are transformed the relevant elements of the datagrams are extracted from the stream and inserted into the appropriate Kafka queue. Figure 2-5 displays the contents of the SFlow packets as received from the SFlow generator.

2.1.4.2 Detection – ML implementation

The machine learning investigation and selection is still ongoing. This is expected to be completed in the next few weeks once sufficient results have been achieved from the models under investigation. As this investigation matures the selected algorithms will be migrated across into the CogNet Common Infrastructure and in particular the machine learning components will be deployed within the CogNet Smart Engine environment. This environment is specifically designed

Figure 2-4 Highlighted first attack

Figure 2-5 SFlow Packet Contents

Page 18: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 18 of 90

to accommodate all of the scenarios for CogNets’ machine learning algorithms and it also allows for the configuration of the input queues to be used for each CogNet Smart Engine instance.

2.1.4.3 Actuation

The output of the CogNet Smart Engine instance for the distributed security enablement scenario is in the form of a policy document that specifies the set of corrective actions to be executed (if needed) based on the latest knowledge learned by the system. This policy will instruct the Policy Manager component within the CogNet Common Infrastructure how to mitigate the detected situation in the network that needs to be corrected. The policy manager will execute the set of corrective actions against the underlying network infrastructure that is specific to this service provider/testbed and takes the form of the NFVI environment.

As stated earlier the OPNFV platform has been selected as the NFVI environment to facilitate the Service Function Chains. The chains consist of multiple types of elements including the following:

• Level 3 Firewall implementing Access Control Lists • SFlow probes monitoring network traffic through the chains • Level 2 firewall determining routing for particular types of traffic traversing the network • Load Balancing/Router for forwarding of traffic to service provider services • Stage 1 Deep Packet Inspection for initial intrusion detection • Stage 2 Application Monitor for in depth application behavior analysis

As traffic enters the network the first point of authentication is by use of 802.1X verification. This is a hard go/no go decision based on records around the credentials supplied determining if the device is allowed to enter the network from the specified access point. Once in the network the traffic must traverse through one of the pre-provisioned chains that have additional layers of security built in and also generate the SFlow metric data. If the traffic is regarded as being not a treat it can proceed as normal into the service providers’ core network to access the required services. If any treats are detected then the traffic can be re-routed to another chain for analysis. This analysis could result in an “all clear” and normal routing is applied for this traffic signature and the rules for the firewalls are updated. For the most part actuation as directed by the policy manager will take the form of modification to elements of the existing service function chains, rerouting of traffic to different service function chains or scaling of additional service function chains in reaction to chaining traffic demands.

2.1.4.3.1 Modification to Service Function Chains

In this form of actuation one or more of the existing elements in the service function chain are modified to reflect the updated knowledge gained by the system. This can take the form of updating firewall rules, modification of the access control lists entries and/or removal of its 802.1X record from the access point server.

2.1.4.3.2 Provision of Service Function Chains

In this form of actuation the required modification may be to scale the chains that are available to the firewalls and load balancers in order to handle not only an increase in normal traffic but also an increase in suspect traffic. The situation may occur where the traffic load may remain stable but

Page 19: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 19 of 90

the amount of suspect traffic where additional analysis is required may have increased exponentially, as in the case of infected nodes infecting new nodes. In this situation additional chains can be provisioned and the firewall rules updated for particular flows to direct to the new inspection zones. The inverse is also correct in that the same methodology can be used to de-provision the chains when the demand has decreased. Figure 2-6 displays the initial configuration of the service function chain for a system where no significant amount of suspect traffic has been identified or detected and a default analysis chain is provisioned so that traffic inspection and can be carried out without delay.

Initial Experimentation results Only initial measurements were executed for the specific testbed. The comprehensive evaluation of the testbed proof-of concept and of the machine learning technique deployed will be presented in detail in D5.4

Experience Following the procedure described in Section 4.1., the below outlines the identified sources of experienced assisting the ML in this use case together with the ranking of the importance and the difficulty level of these sources.

Monitoring Source of experience

Description Importance Difficulty

Time Lack of sufficient monitoring data due to incorrect or insufficient sampling frequency e.g. SFlow

9 5

Figure 2-6 Service Function Chains in the edge security zone

Page 20: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 20 of 90

Task Stale or misconfigured monitoring module 6 2 Context Lack of monitoring capacity, resource limitations or

external monitoring conflicts 4 3

Detection Source of experience

Description Importance Difficulty

Time High latency of detection due to dynamic traffic patterns, increased metrics, incorrect algorithms.

9 9

Task Stale or misconfigured detection module 9 6 Context Conflicts with other detection algorithms, cross talk

or cross contamination of the monitored infrastructure

8 6

Actuation Source of experience

Description Importance Difficulty

Time Failure to invoke needed action (policy) in timely manner (slow invocation, seconds from monitoring to actuation)

8 6

Task Invocation of the most appropriate action (policy) 8 7 Context Policy conflicts, multiple actuations on the same

infrastructure with opposing goals 6 5

Table 2-1 Rank for different sources of experience

Conclusions and possible next steps From the progress made to date and the results seem so far it is clear that the machine learning aspects require additional analysis to determine the most appropriate algorithms to be utilised in the detection of the various attack types. As knowledge around the detection increases this will allow for the simulation environment to be scaled up in order to examine the detection and actuation module performance and how they behave under the increase load.

Migration of the testbed to include utilisation of the NFVI environment is scheduled where all of the various components must be implemented in the form of elements within the service function chains and all traffic generators (servers and clients) reside in the appropriate network location. Once sufficient and appropriate results have been obtained from the testbed there are plans to also introduce distributed security enablement platform into a corporate environment where live data can be streamed into the tested. This will provide a higher degree of understanding and capability of the platform under realistic operating conditions.

Page 21: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 21 of 90

User Manual There are several steps required for the setup of the distributed security enablement testbed where some of the modules are specific to this testbed and some are common to the CogNet Common Infrastructure. Figure 2-7 displays which components are specific to the distributed security enablement testbed.

2.1.8.1 Attack Simulation Environment

The dependencies for this environment are:

• Docker • Open Virtual Switch • Python (within the docker containers)

Provision a virtual machine and install Docker using the following steps:

sudo apt-get install \ apt-transport-https \ ca-certificates \ curl \ software-properties-common curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add – sudo apt-key fingerprint 0EBFCD88 sudo add-apt-repository \ "deb [arch=amd64] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) \ stable" sudo apt-get update

Figure 2-7 Distributed Security Enablement specific components

Page 22: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 22 of 90

sudo apt-get install docker-ce

Next install Open Virtual Switch:

sudo apt update sudo apt install openvswitch-switch

Add a new docker file that is used to configure each of the traffic type containers:

FROM ubuntu:16.04 RUN apt-get update RUN apt-get install net-tools -y RUN apt-get install iputils-ping -y RUN apt-get install tcpdump -y RUN apt-get install wget -y RUN apt-get install nano -y RUN apt-get install python2.7 -y

The Docker containers are now ready to be deployed, this is an example for two initial containers that communicate with each other to generate normal traffic patterns and the open virtual switch will output the SFlow packets based on the network traffic:

sudo docker run -td --name=server01 --net=none DOCKER_IMAGE sudo docker run -td --name=client01 --net=none DOCKER_IMAGE export pubintf=ens3 export privateintf=ovs-br1 sudo iptables -t nat -A POSTROUTING -o $pubintf -j MASQUERADE sudo iptables -A FORWARD -i $privateintf -j ACCEPT sudo iptables -A FORWARD -i $privateintf -o $pubintf -m state --state RELATED,ESTABLISHED -j ACCEPT sudo iptables -S sudo ovs-docker add-port ovs-br1 ens3 server01 --ipaddress=192.168.1.1/16 --gateway=192.168.0.1 sudo ovs-docker add-port ovs-br1 ens3 client01 --ipaddress=192.168.1.2/16 --gateway=192.168.0.1 sudo ovs-vsctl list-ports ovs-br1 sudo ovs-vsctl -- --id=@sflow1 create sFlow agent=ovs-br1 target=\"FEATURE_EXT_VM_IP:6343\" header=128 sampling=64 polling=10 -- set Bridge ovs-br1 sflow=@sflow1

The server container is configured as a file server and the client container is configured to simulate network requests by continuously requesting files from the server and executing pings. Additional containers and more complexity can easily be added to the server. The network traffic simulator environment is now ready for use.

To create an attack node another container is created based on the Kali server. The utility used in this example is the “hping3”7 application. Create the docker file for this new type of container:

7 https://tools.kali.org/information-gathering/hping3

Page 23: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 23 of 90

From kalilinux/kali-linux-docker Run apt-get update Run apt-get install hping3 -y

Add the new container to the Open Virtual Switch configuration:

sudo docker run -td --name=kalicontainer --net=none Kali_DOCKER_IMAGE sudo ovs-docker add-port ovs-br1 ens3 kalicontainer --ipaddress=192.168.1.3/16 --gateway=192.168.0.1

Bash into the container and execute the attack command (this is shown here as being manual for example purposes):

sudo docker exec -it kalicontainer bash hping3 -S --flood -V 192.168.1.1

The container is now constantly executing ping request against the server container thereby flooding the server with requests and blocking other nodes from accessing the server.

2.1.8.2 Monitoring and Transformation Environment

This environment can be run on a virtual machine or within another docker container. For the distributed security enablement testbed this environment is deployed within a docker container on a separate docker server, this is to preserve the abstractions from generation, monitoring, analysis and actuation. The dependencies for this environment are:

• Docker • Python (within the docker containers)

Install docker and provision a container. In this container execute the following to download, build and install the SFlow tool:

git clone https://github.com/sflow/sflowtool.git ./boot.sh ./configure make sudo make install

Once built the container is then set to run, this will listen on a pre-configured port for SFlow traffic from the traffic generation server.

sudo docker run -p 6343:6343/udp sflow/sflowtool -l | python2.7 kafkaProducer.py

All SFlow traffic detected is then processed by this container and the resultant data stream is outputted to the specified Kafka queues.

An example of the output can be seen here:

FLOW,192.168.0.1,21,23,cee2e8493917,c6cab0d03a55,0x0800,0,0,192.168.1.1,192.168.1.2,1,0x00,64,0,0,0x00,102,84,64 CNTR,192.168.0.1,5,6,100000000,0,3,2700,63,0,4294967295,0,0,4294967295,648,8,4294967295,4294967295,0,0,0 FLOW,192.168.0.1,21,23,cee2e8493917,c6cab0d03a55,0x0800,0,0,192.168.1.1,192.168.1.2,1,0x00,64,0,0,0x00,102,84,64 CNTR,192.168.0.1,23,6,10000000000,1,3,31947846679,482569267,0,4294967295,0,0,4294967295,4415057198292,536304036,4294967295,4294967295,0,0,0 CNTR,192.168.0.1,21,6,10000000000,1,3,4415216794788,537929666,0,4294967295,0,0,4294967295,32107445273,484194924,4294967295,4294967295,0,0,0

Page 24: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 24 of 90

FLOW,192.168.0.1,21,23,cee2e8493917,c6cab0d03a55,0x0800,0,0,192.168.1.1,192.168.1.2,1,0x00,64,0,0,0x00,102,84,64 FLOW,192.168.0.1,21,23,cee2e8493917,c6cab0d03a55,0x0800,0,0,192.168.1.1,192.168.1.2,1,0x00,64,0,0,0x00,102,84,64

2.1.8.3 Machine Learning and Policy Recommender Environments

The outputted data is forwarded to the CogNet Smart Engine that is the analysis module within the CogNet common infrastructure. The machine learning algorithms that are specific to the distributed security enablement are implemented and deployed in the smart engine and the output is forwarded to the policy manager, also within the CogNet Common Infrastructure. The policy manager will execute the required and specified actions on the network infrastructure where the attack has been detected and in this case this translates into actions being carried out in the NFVI environment (OPNFV).

2.1.8.4 NFVI Environment

The network infrastructure is simulated using a NFVI environment where traffic generators and attack nodes will all be deployed (both in and alongside). As attacks are detected the CogNet system will identify and recommended actions to be executed to fix the problems. This actuation will be executed on the NFVI environment itself which in the case of the distributed security enablement testbed is the OPNFV platform. This environment was configured using the Fuel installer where one of the steps is detailed in Figure 2-8.

Figure 2-8 Fuel Installer Screen

Page 25: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 25 of 90

Once the Fuel installation has been completed the system is ready to install the OPNFV environment onto servers in the datacentre. The size of the platform to be built is user selectable and once all of the configuration items have been identified the platform can be installed. Example install screens are displayed in Figure 2-9 and Figure 2-10.

Figure 2-9 Hardware selection for OPNFV Environment Installation

Page 26: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 26 of 90

Once the install is complete the dashboard gives the information about the health of the system as well as access into both the Horizon dashboard8 (for OpenStack access to the platform) and to the OpenDaylight9 dashboard (for the Software Defined Networking access to the platform).

8 https://docs.openstack.org/horizon/latest/ 9 https://www.opendaylight.org/

Figure 2-10 Install of the OPNFV for the DSE Testbed

Figure 2-11 OPNFV for the DSE Testbed successfully installed

Page 27: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 27 of 90

With the platform installed the testing environment needs to be configured and deployed. As explained earlier the Service Function Chains will be all deployed within the OPNFV environment and it is from here that the network traffic generators and the attack modules will also be deployed. To close the loop, the actuation required to remedy detected attack situations will all be executed within this OPNFV environment.

Page 28: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 28 of 90

2.2. Honeynet Testbed

Scope The Honeynet testbed was introduced in D5.2 as one of the potential applicability of NFV into Networks threats detection fostered by Machine Learning technologies. The scope of this testbed, as a subset of the distributed security enablement, has the aim of being able to generate useful datasets, as close as possible to the reality, without compromising privacy and test different types of algorithms in the detection based only on the network traffic flows.

The initial scope defined in the deliverable D5.2 explained the identification and classification of some security attack patterns in the data plane, especially relating to encrypted traffic, or what is similar, inspecting data packets from Layer 2 to layer 4, avoiding payload analysis (encrypted or not), and thus, improving the privacy.

Short description (update of D5.2) As it was mentioned in D5.2, the central proposal of this testbed is to use Telefonica’s Mouseworld Lab in the specific use case of security, as opposed to WP4 traffic classification use case but focused in HTTP traffic patterns classification.

In this release, the main activities has been focused in setting up endpoints nodes, injection tools and produce synthetic traffic patterns as a source of netflow version 910 datasets and test initial ML algorithms for evaluation. These patterns include normal traffic, some malware traffic samples and the honeypot related traffic. The dataset obtained has been used to apply non-supervised algorithms to detect anomalies related with security. The obtained results could offer a hint of the usability of these types of algorithms.

Actual items implemented in CogNet A set of assets and resources have been introduced in Mouseworld lab to be used in these experiments. We list them organized by the functionality they offer.

2.2.3.1 Network traffic generator

Normal traffic understood as common ISP traffic (including browsing traffic, video, cloud files, network testing, test speeds, DNS, software updates etc.): This traffic has been used as white noise expected in any ISP network, and where we will introduce the malicious traffic.

Generation of malicious traffic tools (attackers based on malware traffic capture): Based on client-server architecture or in pcap re-injection tools in the network (TCP replay), several interesting samples have been used in this generation for testing the ML algorithms:

10 https://www.ietf.org/rfc/rfc3954.txt

Page 29: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 29 of 90

o Malspam: Generate FTP suspicions traffic.

o Ransomware pseudo-darkleech11: Generate UDP traffic in sequential IP based in same source and destination port.

o Exploit kit rig ek12: UDP and TLS traffic.

o Ransomware variant locky13: Strong data volume per TCP session.

Generation of security attacks using Honeynet capacity. Currently two types of synthetic traffic have been produced, brute force and web application attacks:

o Bruteforce attacks against SSH (can be extended to other protocols [1]): The network traffic to be captured has been generated by a set of clients using security pentesting tools widely used by hackers and “script kids”. Hydra [2], medusa [3] and ncrack[4] were analysed and the last one selected, by its capacity to change the behavioural of the attack (aggressively, simultaneous sessions, timeouts, retries, etc.). On the server side, we implemented the cowrie [5] honeypot VNF, the evolution of Kippo14.

o Web application attacks: In this case, the network traffic is based in HTTP can be encrypted or not. The aim of the experiment is to generate malicious payloads (like URL with application exploits, XSS, RFI, SQL injection, etc.). In this case the client side has been implemented with a specific scripts based on ENISA honeypot training exercise [6] that produce multiples malicious URL request based on well-known attacks patterns (e.g. http://example.com/../../../etc/passwd). The server side is served by Glastopf15 honeypot.

2.2.3.2 Traffic capture and transformation

Network switches enabled with traffic mirroring.

VNF “probe” in charge of traffic capture and Netflow version 9 generation, supporting any kind of IP network traffic. This implementation is set up as a different alternative to tstat format use in WP4 for traffic classification in Mouseworld Lab.

Netflow captures, stores and transforms Machine learning training and validation into CSV format.

11 https://researchcenter.paloaltonetworks.com/2016/12/unit42-campaign-evolution-pseudo-darkleech-2016/ 12 https://threatpost.com/inside-the-rig-exploit-kit/121805/ 13 https://nakedsecurity.sophos.com/2016/02/17/locky-ransomware-what-you-need-to-know/ 14 https://github.com/desaster/kippo 15 https://github.com/mushorg/glastopf

Page 30: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 30 of 90

Testbed Implementation Figure 2-12 shows the Honeynet testbed implemented as an expansion of the Mouseworld lab to generate malicious synthetic traffic, and the monitoring capacity through a dashboard.

Figure 2-12 Honeynet integration in Mouseworld

2.2.4.1 Monitoring

In order to be able to monitor the behavioural a netflow version 9 probe has been implemented. The probe, running as a VNF, collect traffic from networks from the mirror port and generate the netflows dataset used as input for Machine Learning. This information is not use only to produce the data, but to monitor and visualize experiments training. The monitor tool is based on ELK [7] and allows for the monitoring of the training process. For example, next figure shows a couple of examples of generated traffic for testing. In Figure 2-13 (a) SSH bruteforce attack is generated standalone, meanwhile in Figure 2-13 (b) the same attack is injected mixed with the normal browsing traffic.

Page 31: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 31 of 90

Figure 2-13 Traffic patterns examples

2.2.4.2 Detection – ML implementation

The detection phase during this period has been developed to use the Isolation Forest algorithm [8] based on anomaly detection and with better behavioural in multi-dimensional features. This is in conjunction with the rest. The Table 2-2 shows a set of netflow attributes initially involved in the ML are:

Ts Flow start timestamp

da Destination address

Flg Flags

Te Flow end timestamp

sp Source port Ipkt Input packets

Td Duration (seg) dp Destination port

ibyt Input bytes

Sa Source address pr Protocol

Table 2-2 Set of netflow attributes

Page 32: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 32 of 90

Several actions have been evaluated, such as flow grouping, normalize and feature transformation, with the objective to reduce the number of features and flows. A subset of the global flows, has been used for training phase, without discriminate normal or abnormal traffic.

2.2.4.3 Actuation

Once a malicious attack is detected several actions can be executed. A clear case is the model of mitigation described in Section 2.3 where new firewall policies are generated for the detected data traffic with the following degrees of execution:

• ACCEPT – the flows which are considered not malicious are accepted • DROP – the malicious flows are dropped • QUARANTINE – send the flow to a specific security zone where to be processed in a special

manner to determine whether it is an attack or a usage anomaly – new action introduced by CogNet

• HONEYNET – send the flow to a Honeynet network similar to the one presented in this section in order to determine new attack signatures – the proof-of concept implementation on how to handle such data flows is presented in this section.

Initial Experimentation results The set of experiments initially executed are the following:

• Malware detection of malicious pcap re-injection with normal traffic. This experiment started with 320.000 flows and 48 netflow attributes. Optimization attempts in the algorithm has reduced the dataset to 67.000 flows and 38 features.

• SSH and Web application attacks from clients against Honeynet adding normal traffic. The data collected during 12h includes a dataset of 200000 flows and 48 netflow attributes. Algorithm reduced the dataset to 100000 flows and focus in 9 features.

Table 2-3 is an example of the anomalies output detected by the isolation forest identifying the source address (sa) of the offending client.

ts sa score 19/5/17 22:11 172.16.1.206 -0.20580915652274578 19/5/17 22:51 172.16.1.205 -0.2021207121578934 19/5/17 22:51 172.16.1.210 -0.2172456088410626 19/5/17 23:15 172.16.1.203 -0.20761057896409707 19/5/17 23:26 172.16.1.203 -0.20521293821550446 19/5/17 23:53 172.16.1.199 -0.2128848724876481 20/5/17 0:18 172.16.1.206 -0.21755367957227378 20/5/17 0:48 172.16.1.206 -0.20188158358838482 20/5/17 1:15 172.16.1.230 -0.2024498430432281 20/5/17 2:15 172.16.1.203 -0.20557060873537303 20/5/17 2:28 172.16.1.207 -0.21743241073746933

Page 33: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 33 of 90

Table 2-3 Anomalies output detected by the isolation forest identifying source address of offending client.

The initial results obtained in all cases, working with multiple features and large dataset, are producing results with no acceptable precision (below 60% in most cases). Additional re-engineering in algorithms parameters and features is in progress to improve better precision.

Experience Following the procedure described in Section 4.1., the below outlines the identified sources of experienced assisting the ML in this use case together with the ranking of the importance and the difficulty level of these sources.

Monitoring Source of experience

Description Importance Difficulty

Time Lack of monitoring data or wrong time granularity of monitoring data, e.g netflow sampling

5 9

Task Stale or misconfigured monitoring module[s] 9 1 Context Conflicts with other monitoring tasks 1 5

Detection Source of experience

Description Importance Difficulty

Time High latency of detection cause by flows time life (can be several hours)

8 6

Task Stale or misconfigured detection module[s] 9 9 Context Conflicts with other detection algorithms, lack of

detection capacity (no accurate algorithm) 8 5

Table 2-4 Rank for different sources of experience

Conclusions and possible next steps Due to the limited results obtained, the experiments are not conclusive and additional tests must be performed with the Isolation Forest algorithm, varying some of the parameters such as the estimators and contamination parameters in order to further evaluate the algorithm.

Related with the possible next steps the plan includes combining real traffic originating from internet honeypots (traffic patterns captures in the Mouseworld) to re-train the algorithms with new threats and attacks.

Additional mid-term plans include the evaluation of the supervised algorithms, generating new datasets and the labelling process, with the aim of comparing the results obtained in this experimentation phase of non-supervised. Also planning on expanding the types of malicious

Page 34: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 34 of 90

traffic with new families using commercial security application traffic generators, such as the IXIA Breaking Point16 product that is in the short-term roadmap of Mouseworld Lab.

User Manual 2.2.8.1 Traffic Generation

Generation of Malicious traffic by means of re-injection malware traffic captures.

The re-injection process is based in tcpreplay17 tool. It allows to use a previously capture network traffic to replay it inside the network. Next command replay a capture file (malware.pcap) into the Mouseworld through the /dev/ens6 network interface:

$ sudo tcpreplay -i ens6 malware.pcap

As a result, the Netflow probe will see the complete traffic session as a normal traffic happening in real time.

Generation of a brute force attack against Honeynet

Next command produce the generation attacks against SSH honeypot cowrie server (172.16.1.105) from one Mouseworld client with ncrack tool and a password words (pass_wordlists) dictionary:

$ sudo ncrack -p 22 -user root -P ../crunch-3.6/pass_wordlist 172.16.1.105 -vvv -T5

Generation of web application attacks against Honeynet

In order to generate realistic attacks against Web servers we produce variations of malicious URLs against Glastopf Honeypot (172.16.1.106), using ENISA honeypot training exercise 3.2 available in the ENISA virtual machine

/opt/exercise3.2 172.16.1.106

2.2.8.2 Traffic capture and dataset generation (probe VNF)

1. Install applications and dependencies:

$ sudo apt-get install softflowd nfdump

2. Generate netflow version 9 from network mirror received by “probe” from ens6 interface. Data will be duplicated to the Kibana Visualization tool and locally to generate the dataset:

$ sudo softflowd -i ens6 -n 192.168.159.153:9995 -v 9 -t tcp=60s $ sudo softflowd -i ens6 -n 127.0.0.1:9995

3. Capture netflow and store: $ sudo nfcapd -w -D -p 9995 -z -t 120 -l /home/cognet/netflow/source1

16 https://www.ixiacom.com/products/breakingpoint 17 http://tcpreplay.synfin.net/

Page 35: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 35 of 90

4. Export result to csv format: $ nfdump -R /home/cognet/netflow/source1 -o csv > /home/probe/flow_malware.csv

2.2.8.3 ML algorithms validation.

The code can be download from:

$ git clone https://github.com/CogNet-5GPPP/WP5-CSE-Final/tree/master/ MouseWorld

Prerequisites

Following python2.7 libraries are required:

• numpy

• pandas

• skylearn

• matplotlib

Dataset

Synthetic netflow traffic has been used to train and parameterize the algorithm. Mouseworld lab generates flows > 500K to use with this algorithm. A sample dataset file (15K flows), "dataset_test.csv" is included as an example to test the algorithm. The dataset includes normal Mouseworld browsing traffic, and 4 flows with known malicious trojan traffic18. The results show between the first anomalies some of these flows.

Execution

In order to execute the code, just leave a dataset file in csv format into the same directory and run:

$ python ./honeypot.py

The processed dataset with the different techniques is stored in a new file called "dataset_processed.csv" as an output. The anomaly identification results are stored in a new file called "results.csv", also a standard output is generated with the anomaly list.

2.3. NFV Security Anomaly Detection Testbed

Scope The scope of the testbed is to provide a proof-of concept on how dynamically distributed security mechanisms can be embedded directly into the NFVI. Through this it proves that Security as a Service (SECaaS) can be offered by cloud providers as part of their NFV offering while at the same

18 http://www.malware-traffic-analysis.net/2017/06/21/index.html

Page 36: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 36 of 90

time being able to adapt dynamically to the new types of attacks that may appear in the tenant networks.

Specifically, the SECaaS is composed of a set of firewall components which are distributed through the tenant network. The firewall components are providing secure connectivity to the devices to the tenant network, practically being a distributed firewall included in the tenant network and under the control of the infrastructure provider.

In this context, the machine learning has a major role in determining dynamically new types of attacks (i.e. anomalies in usage behaviour), for which new security policies can be immediately installed. This results in being able to protect tenant networks (and ultimately also the virtual network infrastructure) from security vulnerabilities specific to that tenant network and which are potentially unknown to the infrastructure provider. This type of functionality cannot be provided with current policy based dynamic firewalls, as the malicious behaviour is highly dependent on the specifics of the tenant network, and thus should be trained depending on the specific deployment usage.

Figure 2-14 - Adding Security with ML in the Cloud Infrastructure

Testbed Description (update of D5.2) The NFV Security Testbed, as introduced in deliverable D5.2, integrates the OpenSDNCore into OpenStack to enable customizable, live network traffic monitoring and to provide complete control over traffic flows including flow relocation, flow classification and firewall rule enforcement to block or redirect malicious traffic or introduce rate limitation on certain flows.

This release mainly focuses on the further development of data acquisition and policy actuation capabilities of the testbed using machine learning algorithms for policy selection and to detect unknown attacks by detecting unusual, anomalous traffic patterns, based on the data obtained by the implemented monitoring functionality.

Page 37: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 37 of 90

The machine learning anomaly detection functionality was implemented in the form of very basic proof of concept neural networks as presented in the Section 2.3.2.2 as it is foreseen that for the final evaluation the same machine learning mechanisms as for determining resilience related anomalies will be used (as described in Section 3.1).

The testbed architecture is illustrated in Figure 2-15. In addition to deliverable D5.2 special attention is given to the extension of the security management functionality to the existing network management architecture and its adaptation to the NFV environment, as a proof-of concept that NFVI level security can be used for the protection of the tenant networks. The proposed extended security management functionality needs to be added to the end-to-end system in order to support the required security as part of the NFVI management functionality, and thus will work as a supporting service transparent to the NFV orchestration of the tenant networks.

Figure 2-15 NFV security testbed

To be able to provide security at NFVI level, the testbed was built considering the following characteristics, which make the usage of existing firewalls obsolete.

• Network topology is highly dynamic – from the NFVI perspective, the topology of the network is continuously changing as new tenant networks are deployed or terminated as well as due to the dynamic scheduling of the cloud resources (the compute scheduling makes a large number of migration decisions in order to optimize the resource usage resulting in an equivalent number of network topology changes as seen from the NFVI perspective.

• Centralized threat detection – the NFVI system is monitored in a centralized fashion in order to proper manage the resources and the tenant networks, the same monitoring information can be used to detect threats. Additionally, any forwarding entity at the NFVI level is monitoring the data traffic and can dynamically introduce new routing and security rules, all these entities could be used as dynamic firewalls. However, careful attention should be given in the next deployment phases on selecting only specific forwarding entities in order to be able to maintain a proper scalability of the system.

• Dynamic actuation - An extended policy engine can provide numerous methods of actuations in case of attacks such as initiating new security zones (possibly by interacting with the MANO functionality), isolating parts of the network when attacked and rebooting

Page 38: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 38 of 90

the virtual machines with protected information (this is a very powerful means of reconfiguration. Others may be run on the fly like immediate reconfiguration, however the risk would be high not to eliminate the attack). However, the main functionality considered is the filtering of the data flows depending on the trust of the NFVI provider into trusted flows, possible attacks, each with different actions (i.e. ACCEPT, QUARANTINE and HONEYNET or DROP).

The main advantage of this security solution integrated within NFVI is that the cloud provider is able to secure all the tenant networks no matter if they have their own security mechanisms or not, through this drastically reducing the risk of attacks due to vulnerabilities in tenant networks.

Some components have already been implemented and described in the previous D5.2 deliverable (the basic packet core and the monitoring and visualization system). Only the updated components with additional features are described here.

• Policy Engine – the role of the policy engine is to trigger actions relating to changes in data path entities in regard to some specific data flows, application flows or subscribers. The policy engine is responsible for the modification of data path processing related to security of the system (currently it is used only for routing as security is not a feature of the existing NFVI deployments). For this, the policy engine interacts directly with the firewall component modifying their processing rules including which flows to be accepted, dropped or quarantined. It also manages the adaptation of data flow routing depending on different security zones based on service topology.

With the advanced monitoring system, the different sources of monitored data can be aggregated as input into the CSE, the same is true for the series of actions triggered by the centralized policy engine across the system to respond some specific event. With a single policy engine (point of decision making) handling multiple operations as centralized logical entity independent of the firewall entities enable to maintain a better synchronization across the system as well as to consider all the conditions of system especially the virtual network topology when selecting an action. The main scalability issues for such a solution could be circumvented by the development of a proper scheduling of security functionality within the NFVI networking components by selecting only the needed components in a specific situation, however this is a data center networking development feature and not within the scope of CogNet.

• SDN Controller: In order for the policy engine to communicate with the firewall components in a synchronized manner, an SDN controller is added to the infrastructure whose purpose is to translate the different actions coming from policy engine to some specific configuration language understood by the firewall. The SDN controller as presented in D5.2acts as an add-on to the SDN functionality within OpenStack through this being able to provide security related added value to the existing routing system.

2.3.2.1 Monitoring

The monitoring metrics information is acquired from the agents which are located in each of the virtual switches by the monitoring server. All the metrics data are stored in a centralized database.

Page 39: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 39 of 90

Also, the monitoring server is able to transmit simple threshold-based events to other entities of the system acting as basic alarms.

For the specific implementation, the monitoring agents are placed within the OpenSDNCore OpenFlow Switch (OFS). In OFS the monitoring takes place at the flow level, uniquely differentiating between network flows of different protocols, sources and destinations. Throughput and protocol specific metrics are calculated and stored in the centralized database, available for inspection for the detection entity.

The time series based monitoring system has an advantage that it stores historical information and normal behaviour of the system can be predicted including complex analysis can be done based on this information as for the most of the machine learning testbeds described in this deliverable.

As the current system has as purpose to determine anomalies in the usage of the tenant networks the monitored metrics are related to the aggregated number of data flows (e.g. number of flows, number of new flows, number of terminated flows) and to the aggregated data traffic (e.g. aggregated data traffic, average data traffic per flow). With this, different types of Denial of Service attacks can be determined as well as possible probing of system vulnerabilities through overloading of the system.

2.3.2.2 Detection – ML implementation

A very simple and not very accurate machine learning mechanism was developed so as to be able to integrate an end-to-end system for the proof of concept of NFVI level security. In order to provide the proper evaluation of the system, this simple machine learning mechanism will be

Figure 2-16 Example of anomaly detection

Page 40: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 40 of 90

replaced during the evaluation phase with the Anomaly Detection Engine (ADE), as described in the resilience testbed in Section 3.1.

An example of anomaly detection is shown in Figure 2-16. The CPU usage metric was monitored on one of the software components deployed on top of cloud infrastructure. A neural network with 1 hidden layer of 25 neurons was deployed and trained on 18000 historical data points. The system is capable of predicting the behaviour of observed metric one second in advance. The anomaly detection compares between the system behaviour (blue) and the predicted normal behaviour (red). When the difference between them (indicated by yellow) is larger than some specified threshold (light green), then, the system predicts detection of anomaly in the behaviour of that component.

The detection of an anomaly is similar to possible attack, considering that input metrics are specific to security such as number of active data flows, size of flows, size of new flows etc. The anomalies can occur for multiple reasons:

• Malfunctioning of software system: The anomaly indicates that component may fail in the near future. Thus, it can be addressed by high availability and resilience mechanisms.

• Long term performance degradation: The anomaly indicates that the software components are misbehaving due to its usage for extended period of time. Any software components have specific residues in CPU usage, memory, network or storage which gets accumulated when used for long time and resulted into performance degradation. The easy solution is to restart the component to bring back to its initial performance level.

• Sudden increase of the load: It may be triggered due to the modification of the users’ behaviour which can’t be predicted and hence, the system treats it as an attack and administrator intervention is required to bring it back to normality.

• New attacks: It is non-deterministic and hard to predict. • False positive detection: The system may behave normally but the anomaly detection tool

is indicating occurrence of anomaly.

In order to distinguish between new unknown attacks, false positives or other types of anomalies, an anomaly classification process has to be executed. A correlation of anomaly patterns can be executed when the system has passed through at least one thousand anomalies. The detected anomalies are then grouped into different classes. The system administrator determines whether there is an anomaly or false positive based on those groups of anomalies (at least 100 entries per anomaly). The more interaction of the system administrator and the system means the system is able to accumulate more experiences. This shortens the training duration for anomaly classification. Also, a large number of false positives were determined in the first phase.

However, with the extraction of different categories into known false positive classes, the result will be that only the new anomalies will be detected including new attacks. The firewall processing is common for all the services and the anomaly classes can be imported across different tenants as well as the known attack signatures. This reduces the training duration for anomaly classification.

As there is very limited knowledge on how the logics of the final system would be and as there is a need for a huge amount of data (e.g. at least 1000 labelled anomalies of each type), only initial steps towards fault classification will be made as part of the evaluation phase.

Page 41: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 41 of 90

2.3.2.3 Actuation

Due to the multiple actions that have to be carried out for maintaining a functioning state of the system, the mitigation process becomes complex for anomalies as well as for the protection of the system against known threats.

The testbed assumes that a two steps actuation should be put in place in real systems. In the first stage the flows are checked against known attack signatures and the attacks are isolated as in the case of the current firewalls. In the second stage the anomalies in behaviour are detected from the data traffic which did not match any attack signature. As the first phase pertains to basic security policy processing and as the machine learning added value is only for the second step, only the second step was implemented.

For actuation, the policy engine was developed to sustain the functionality to create, modify and delete the following set of rules.

• ACCEPT and forwarding rules to send the data traffic to a “normal” behavior network function. These rules may be extended to consider the scalability of the system, as it includes also forwarding functionality.

• ACCEPT and redirect rules which are similar to the previous ones in form, however, they forward the data traffic to a “quarantine” network function. When an anomaly is detected, the data flow if re-directed by introducing such a rule into the system

• ACCEPT and redirect rules towards a “Honeynet” function. Similar to the quarantine when the anomaly repeats itself for a number of times bigger than a threshold while being in “quarantine”

• DROP – for determined attacks.

The rules are sent to the SDN controller which is in responsible for the distribution of the rules to the firewall components for immediate use within the system.

Actual items implemented in CogNet For CogNet, the following items were implemented on top of the existing integration between the Fraunhofer OpenSDNCore and OpenStack which was presented in D5.2:

• Extension of the OpenSDNCore Switch (OFS) to monitor data flow related information. Monitoring on one hand is possible on physical network interfaces, inspecting all traffic entering and leaving the configured interface. On the other hand a flow rule level monitoring capability is introduced to allow the exclusion of certain flows from inspection for example in the case of traffic redirection. Further details of monitoring are explained in Section 2.3.2.1.

• A very simple anomaly detection module using state of the art neural networks as a proof of concept that end-to-end security with machine learning is functional possible.

• Development of a very simple PoC policy engine to be able to make decisions and to transmit commands to the OpenSDNCore Controller (OFC). Policy actuation includes the selection of an appropriate policy to enforce as a response to detected threats, anomalies

Page 42: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 42 of 90

or new traffic patterns and the translation of the policies into routing or firewall rules for the OpenFlow Switch. Policy actuation is further described in Section 2.3.2.3.

• Integration of the OFC component with the policy engine and the translation of the rules to OpenFlow rules.

Initial Experimentation results As presented in the Section 2.3.2.2 on detection system, large anomalies were already detected by the system and the system reacted properly to these. However, at the current moment the reaction is parametrized at 4 seconds, deemed too long for an attack. A better parametrization is considered for the evaluation work while at the same time replacing the machine learning part with the anomaly detection as presented in Section 3.1.

Experience Following the procedure described in Section 4.1., the below outlines the identified sources of experienced assisting the ML in this use case together with the ranking of the importance and the difficulty level of these sources.

Monitoring Source of experience

Description Importance Difficulty

Time Lack of monitoring data due to difficulty to efficiently extract certain metrics

6 5

Task Stale or misconfigured monitoring module 8 3 Context Lack of monitoring capacity (Compute resources),

big amount of stored/sent database data 4 4

Detection Source of experience

Description Importance Difficulty

Time High latency of detection due to data polling request latency (several seconds instead of immediate)

7 8

Task Stale or misconfigured detection module 9 5 Context Conflicts with other detection algorithms, lack of

detection capacity (low algorithm accuracy) 8 4

Actuation Source of experience

Description Importance Difficulty

Time Failure to invoke needed action (policy) in timely manner (slow invocation, seconds from monitoring to actuation)

8 6

Task Invocation of the most appropriate action (policy) 9 8

Page 43: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 43 of 90

Context Policy conflicts, lack of policy domains of needed granularity

6 6

Table 2-5 Rank for different sources of experience

Conclusions and possible next steps With the inclusion of the security through policy engine and firewall and threat detection mechanisms as part of the NFVI as reported in this section, a set of conclusions can be drawn mainly on the best practice of implementing an end-to-end prototype.

• Inclusion of security as part of the NFVI functionality: As long as the NFVI has to execute routing and centralizing monitoring, it is assumed that up to certain scale, security features are feasible. Please note that DPI at infrastructure layer would be too complicated as it presumes the checking of all the data packets in detail.

• Centralized decision making: The decision making firewall which exposes a set of uniform interfaces coordinated by a SDN controller provides a means for the system to adapt the topology changes which includes adding new tenants, altering firewalls, retiring compromised VNFs and routing changes which include handling data flows in the appropriate security zones.

• Detecting unknown attacks: With the analysis of the statistics, the system can learn the normal behaviour of the specific tenant network and predict how it should behave in the future. When the behaviour of the tenant’s network diverts from the normal, an anomaly is considered and if its behaviour is not understood, it is considered of unknown attack type. But the system gains experience over time and adapts automatically to handle such attacks by external and internal components.

User Manual The OpenStack installation with OpenSDNCore is as described in D5.2 including the inclusion of the controller as part of the OpenStack. The decision making in the centralized policy engine is co-located with the controller and follows the same installation.

2.4. LSP-CLUSTER: Network Traffic Clustering towards Anomaly Detection

Scope In Deliverable D3.3, we presented LSSVM-Spark 2.0, a general purpose structured prediction framework for doing supervised clustering using Latent SVMstruct19 (LSSVM) [9].

19 http://www.cs.cornell.edu/~cnyu/latentssvm/

Page 44: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 44 of 90

Here, we propose LSP-Cluster, a module that demonstrates the benefit of this kind of approaches in application to the real network data. Going beyond the network classification scenario (Deliverable D5.2), we used the NetCla20 dataset of labeled network transmissions to check whether it is possible to learn to cluster them. The structured model captures the similarity patterns between the pairs of transmissions of the same type, and use them to group together new transmissions even if they are of the type not present in the training data.

This model is a further step towards anomaly detection due to its ability to distinguish between types of (normal) traffic including those previously unseen. Regarding the latter, it can also detect new types of traffic crossing the data plane for the first time. Any transmission not falling into any traffic cluster detected by the model – a singleton – can be regarded as a potential anomaly.

Note that LSP-cluster is a stand-alone approach based on the supervised clustering and is different from the other anomaly-detection approaches devised on CogNet. For example, it is different in nature to the Anomaly Detection Ensemble (ADE) approach (Deliverable D5.2) as it is stand-alone, while ADE is an ensemble that takes outputs of the stand-alone approaches as inputs. LSP-cluster differs also from the anomaly_detection module (Deliverable D3.3), which learns to predict normal behavior, and classifies as anomalies the cases when the observed behavior significantly differs from the predicted normal behavior.

Short description (update of D5.2) When developing the component we used the ECML-PKDD Network Classification Challenge (NetCla) data for training and testing. Each data point in this dataset is a network transmission, generated by a certain type of application. NetCla provides various performance indicators and network parameters for each transmission that can be used as machine learning features.

Testbed Implementation 2.4.3.1 Data Preparation

In the NetCla data, the transmissions are annotated with 19 classes (application types) plus 0 class. We filter out the data points belonging to 0 class as they are generated by a variety of applications with very few data points generated by each application. Then, we organize our training and test data in such way, that test data contain data points of the classes unseen in training data. For this purpose, we divide the remaining 19 classes into three disjoint sets, 𝐶𝐶1, 𝐶𝐶2, and 𝐶𝐶3. We assemble the train and test set of points which belong to 𝐶𝐶1 ∪ 𝐶𝐶2 (train set, TR) and to 𝐶𝐶1 ∪ 𝐶𝐶3 (test set, TE), respectively. Naturally, each data point belongs either to TR or to TE, i.e. 𝑇𝑇𝑇𝑇 ∩ 𝑇𝑇𝑇𝑇 = ∅. This way, TR and TE are disjoint in data point level. Then, TE contains also data points, which belong to the classes not available in TR (𝐶𝐶3 from above), along with the data points of classes observed in TR (𝐶𝐶1).

20 http://www.neteye-blog.com/netcla-the-ecml-pkdd-network-classification-challenge/

Page 45: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 45 of 90

We split all the data into the samples of 𝑁𝑁 = 30 transmissions, in the order they were generated in the network, and, for each transmission 𝑢𝑢, we form pairwise feature vectors with 𝑁𝑁 − 1 other transmissions, 𝑣𝑣, from the same sample, each pair (𝑢𝑢, 𝑣𝑣) receiving a label, 1 or 0, depending on whether two transmissions belong to the same class or not.

Each NetCla transmission data point is associated with a number of transmission parameters or features21. We extract the pairwise feature representation for a pair of data points, (𝑢𝑢,𝑣𝑣), as follows: from the pair of the 𝑖𝑖th features, 𝑢𝑢𝑖𝑖 and 𝑣𝑣𝑖𝑖, we form a pairwise feature with value:

𝑚𝑚𝑚𝑚𝑚𝑚𝑖𝑖 − 𝑚𝑚𝑖𝑖𝑚𝑚𝑖𝑖 − |𝑢𝑢𝑖𝑖 − 𝑣𝑣𝑖𝑖| ,

where 𝑚𝑚𝑚𝑚𝑚𝑚𝑖𝑖 and 𝑚𝑚𝑖𝑖𝑚𝑚𝑖𝑖 are the max and the min values, correspondingly, of the 𝑖𝑖th feature over all the data points in the training set. Moreover, similarly to what we did in the Traffic Classification module (Deliverable 5.2, Section 2.4), we discretize continuous value features using Multi-interval Supervised Attribute Discretization method[10], using its implementation from the weka22 toolkit.

2.4.3.2 Detection – ML Implementation

For each sample above, we form graphs out of the pairwise "links” between transmissions (Figure 2-17). In these graphs each transmission is a node, moreover, we add an artificial node, 𝑡𝑡0, needed for learning and inference purposes. We connect each transmission node 𝑡𝑡𝑖𝑖 to each of the preceding transmission nodes 𝑡𝑡1, … , 𝑡𝑡𝑖𝑖−1. Each edge corresponds to a transmission pair�𝑡𝑡𝑗𝑗 , 𝑡𝑡𝑘𝑘�, where 𝑗𝑗 < 𝑘𝑘; it is associated with pairwise features and labeled 0 or 1 (in the training phase) as described in Section 2.4.3.1.

Such graphs are the input examples to the Latent Structured Perceptron (LSP) learning and prediction algorithm [11]. Due to the large scale of the task (we have obtained over 7,000 samples of 30 data points, i.e., over 7,000 graphs), we use here the LSP solver, which is more efficient than Latent SVMstruct (as we showed in [12]).

At the prediction stage, the algorithm computes weights for each edge using the learned model. Then we do the inference, which consists in finding the maximum spanning trees (Figure 2-18) on the sample graphs. We find such trees by iterating over the transmission nodes 𝑡𝑡𝑖𝑖 and selecting for each of them the edge with the highest score. This way, each transmission node 𝑡𝑡𝑖𝑖 is connected to another preceding node 𝑡𝑡𝑗𝑗 with which they have the highest similarity according to the model. This means it is connected to the cluster to which 𝑡𝑡𝑗𝑗 belongs.

21 See http://www.neteye-blog.com/netcla-the-ecml-pkdd-network-classification-challenge/ for the full list of features 22http://www.cs.waikato.ac.nz/ml/weka/

Page 46: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 46 of 90

To evaluate the accuracy of the proposed model, we use simple evaluation statistics:

• Average number of wrong links per tree,

• Percentage of wrongly clustered samples,

• Classification accuracy, if assigning the majority class label to all the data points of a cluster.

Initial Experimentation results We prepared the train and test data as described in Section 2.4.3.1 and evaluated the performance of the approach using the following evaluation metrics:

• Average number of wrong links per tree: 4.5683

• Percentage of misclassified pairwise clustering decisions: 0.1483

• Classification Macro F1: 0.6966 and Micro F1: 0.7808 (when assigned cluster majority labels)

The results above can be reproduced with train and test command examples from the User Manual in Section 2.4.7.

Experience This section is not applicable to our particular contribution, as it is more relevant to the ones with testbeds.

Conclusions and possible next steps In the following, we intend to apply the proposed model to the real data containing anomalies if there is such or find a way to simulate the anomalous data points. The model is to be further

Figure 2-17 Graph of pairwise links between transmission nodes

Figure 2-18 Spanning tree

Page 47: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 47 of 90

adjusted, as to how to manage the normal traffic, how to represent anomalies, and how to operate in the online regime. Finally, the model can be suggested for the use within the ADE approach as another ensemble component.

User Manual This is a C\C++ implementation, which relies on the LSSVM implementation reusing its data types.

2.4.7.1 Download and Installation

Use the following command to clone LSP-CLUSTER from the repository.

git clone https://github.com/CogNet-5GPPP/WP5-CSE-Final.git cd ./WP5-CSE-Final/lsp-cluster

I. Download the original LSSVM library

$ wget http://www.cs.cornell.edu/~cnyu/latentssvm/latentnpcoref_v0.12.tar.gz $ tar xvzf latentnpcoref_v0.12.tar.gz $ cd latentnpcoref_v0.12

Perform the following modifications:

1. Add directive #include <cstddef> to DisjointSets.cpp 2. Add directive #include <assert.h> to np_helper.cc, 3. Remove (or comment) #include "svm_struct_latent_api_types.h" from

svm_struct_latent_api.h 4. Add -w -fPIC flags to CFLAGS in Makefile and in svm_light/Makefile

II. Compile the code

Return to the root folder and build

$ cd .. $ make clean $ make

III. Unpack example data

If you plan to train a model from the example data supplied along with the current tool, you need to unpack them first:

$ cd data $ unzip train.data.zip $ unzip test.data.zip $ cd ..

2.4.7.2 Training

The command for training a model has the following form:

Page 48: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 48 of 90

$ ./sp_clust_learn -c <C_value> -n <n_value> --k <k_value> data_file_mask model_file <number_of_samples> <number_of_features>

Here, -c parameter is used to set the regularization parameter C, -n indicates the number of the perceptron epochs, and --k – the relative cost of a wrong pairwise link (from [0.0...1.0]; default is 1.0), followed by

• the training data file name mask, where data_file_mask<id> contains training data of a sample having id <id>, where <id> is a number from [1..N]

• model_file – file where the model will be saved

• number of training samples to be used for training

• number of features

To train a model on the example data, you can run:

$ ./sp_clust_learn -c 1000.0 -n 100 --k 0.1 ./data/train-smpl ./data/train-ex3241c1000n100.model 3241 35881

The input training/test data format follows the LSSVM format described in Deliverable 3.2 for LSSVM-Spark. The only distinction is that, here, we add for each pairwise feature vector the class identifiers of the two transmissions, 5 8 in the end of line below:

-1 912:1.0 1295:1.0 1574:1.0 … 997180:1.0 # 1 1 2 5 8

2.4.7.3 Deployment

The test command has a similar form:

$ ./sp_clust_classify --k <k_value> data_file_mask model_file output_directory_path <number_of_samples>

• Here, -k parameter is optional, output_directory_path is a path to the directory where to save the files with the output clustering results

To test the model trained on the example data above, run:

$ rm ./data/cluster/* $ rm ./data/majority_class.txt $ ./sp_clust_classify --k 0.1 ./data/test-smpl ./data/train-ex3241c1000n100.model ./data/cluster 4566

On successful completion of the above command, the majority class labels computed for each cluster will be saved to ./data/majority_class.txt in correspondence to each test transmission, and can be compared against the gold class labels in ./data/test_gold_class.txt using the official NetCla scorer.

Page 49: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 49 of 90

3. Network Resilience Testbeds

3.1. Performance degradation (Dense Urban Area Testbed)

Scope In order to avoid performance degradation leading to downtime of the running services/applications, it is required for the system to estimate and provide enough resources for them to be running despite having failures and shutdown of the components. But, in most cases, the servers in existing datacentres, most of the times, are underutilized because of the overprovisioning which is intended for peak demand period to avoid failures. So, there is a trade-off between keeping the service running despite failures and resource allocation for the system during their operational periods.

Hence, the machine learning techniques can be employed to better understand and analyse the indicators of the performance degradation in networks/systems and detect degradations through anomaly detection techniques and take corrective actions in advance and reduce their potential damaging effects. Because of this, the overprovisioning can be drastically reduced while at the same time the system can be tailored to the specific service needs.

The machine learning algorithms and example mitigation actions ultimately address the specifics of the normal behaviour of the system in the sense of infrastructure resource consumption, network function resource consumption and their correlation as well as the subscriber communication. All the other behaviour is considered an anomaly and has to be treated with the specific mechanisms.

Testbed Implementation Most of the testbed implementation components were presented in D5.2. In this section, only updates to that specification are presented, specifically, the updates in the machine learning anomaly detection and the exemplary interaction with the network functions in the form of mitigation actions, through this completing the machine learning loop.

In Figure 3-1 the Open5GCore system including the exemplary actuation system for carrier grade software networks is presented. Specifically, for the mitigation action of a performance anomaly, the Open5GCore was deployed with an overprovisioned MMEs: one MME for the active service and one in hot standby with a sharing memory system between them to be able to replicate state. The load of the MMEs is managed by an MME-Load Balancer (MME LB) which has the role in the specific testbed deployment to direct the control plane requests to the active MME. To ensure the high availability of the network, when an anomaly is detected in the active MME a complex procedure of transferring the load to the hot standby MME is executed together with the bringing the system back to the active state.

The MME load balancer is working at the S1-AP transaction level granularity, being able to split subscriber sessions, but not procedures. This means that during the attachment, the subscriber

Page 50: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 50 of 90

may be served by multiple MMEs which share the subscriber state, however, during a procedure it can be served by only one. Though this implementation, the load balancing is highly granular and enables immediate changing of the serving MME. The shared memory system deployed between the two MMEs is based on the Redis.io and includes the complete subscriber state.

For this, a basic configuration management stub was introduced into the system which through remote command line commands changes the behaviour or the MME-LB to direct the requests to the standby MME in case the monitored one enters into an anomaly state.

Additionally, basic orchestration functionality is added enabling the rebooting of the MME component which behaves in an abnormal manner. This could have been implemented using a comprehensive standard orchestrator such as the Fraunhofer OpenBaton NFV orchestrator, however, this would have involved an extended redundant functionality and possible side effects during the evaluation.

3.1.2.1 Monitoring

The monitoring server Zabbix observes and sends numerous parameters of the system like incoming/outgoing traffic through various network interfaces, health of system including CPU, memory and disk usage metrics. The Zabbix agents are installed in the systems which need to be monitored and send the information to the monitoring server. The monitored metrics offered by Zabbix and customized metrics of Open5GCore have been presented in D5.2.

Figure 3-1 Testbed showing MME-LB managing multiple MMEs for handling performance degradation

Page 51: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 51 of 90

3.1.2.2 Detection – ML implementation

In D5.2, Anomaly Detection Ensemble (ADE) approach for the early detection of anomalies was discussed. A weighted anomaly window as ground truth for training was used which prioritized early detection of anomalies. The ADE technique was quite successful and showed better improvement of earliest detection across all considered dataset. In this deliverable, Long Short Term Memory (LSTM) for anomaly detection is used which is explained further in Section 3.1.2.2.1.

3.1.2.2.1 Anomaly Detection with LSTM

Anomaly detection refers to the problem of finding patterns in data that do not conform to expected behaviour. These non-conforming patterns are often referred to as anomalies, outliers, discordant observations, exceptions, aberrations, surprises, peculiarities or contaminants in different application domains. This module addresses the problem of anomaly detection by applying long short term memory (LSTM) for predicting the behaviour of the system after being trained on the normal behavioural data. LSTMs are a type of recurrent neural networks that aim to combat the problem of the vanishing / exploding gradient by introducing gates and an explicitly defined memory cell. Through the use of gates, LSTMs are able to remember or forget information passed through them and hence are well-suited to be applied to time series data when there are some long or unknown gaps between important events. LSTMs have been shown to be able to learn complex sequences, such as writing like Shakespeare or composing primitive music. In our approach, we compare the predictions with the actual values for each point and when their squared difference is beyond a certain threshold we signal it as an anomaly. The module employs a dynamic threshold for determining the anomaly utilizing the squared prediction error between the actual and predicted value. The objectives of this module are threefold:

1) Applying the Anomaly Detection with LSTM module from WP3 to real data collected from the Fraunhofer testbed.

2) Adapting the model (e.g., tuning for optimal number of epochs through early stopping, exploring batch-length and look back) for the normal behavior data from the testbed.

3) An approach for determining the threshold, irrespective of whether the abnormal data contains labels (i.e., ground truth) or not.

The datasets expected in real-life scenarios are typically not labelled, i.e., do not contain a field that can be 1 or 0 depending on whether the record is an anomaly or not, respectively. This brings a limitation to the set of techniques that can be applied for detecting anomalies. For instance, classifiers generally require labels in order to perform a multi-class classification. Hence, in this module we address this limitation by providing a dynamic approach to calculating the threshold on the squared prediction error without the need of labels.

Page 52: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 52 of 90

3.1.2.3 Actuation

The current testbed has as actuation mechanisms a hot standby MME solution. In case an anomaly is detected in the active MME, then the control plane requests are forwarded to the standby MME. This is executed through the following steps:

1) The Anomaly is detected by ADE. It transmits a notification to the configuration and orchestration stub functionality.

2) The configuration functionality will indicate to the MME LB to forward all the requests to the hot standby MME

3) The MME having the abnormal behaviour is rebooted by the orchestration functionality.

With these simple steps, whenever the MME enters into an unexpected state, the system is able to react proactively and remove the possible anomaly.

The actuation presented represents only a proof of concept on how actuation may be realized in a carrier grade system, specifically underlining the advantages of the machine learning. Without the ADE providing insight when the active MME starts misbehaving, the current system has to wait until a failure is detected and only then reactively to forward the requests to the standby component. Due to the reactive nature of the process, the system will have a set of lost sessions resulting into a worse service towards the subscribers.

Instead, with the anomaly detection, the system is able to adapt when anomalies occur before the component is completely failing. Please note that the redirection of the load balancer with complete session handover is seamless at the level of the longest procedure (an attachment was measured at 40ms end-to-end) when it is notified, thus adaptation itself is not problematic as much as the detection of the event which may take 5 or 10 seconds. Here is where anomaly detection is coming to help, especially to reduce the interval between when the interval is detected and the moment the system becomes again stable.

Other actuations could be considered for the other components within the system. However, the mechanisms of hot standby are virtually functioning in the same manner, thus not providing any additional insight on how the system would behave better with machine learning. Furthermore, if hot standby components are deployed for two or more of the components in the system, the testbed becomes highly complex, resulting into a complex policy based management. Thus, in order to maintain the system simple enough not to have side effects during the evaluation phase a single actuation was chosen.

Initial Experimentation results Figure 3-2 presents the results of the model on the abnormal data with labels. The figure depicts the predicted versus the actual measurements. The plot also depicts the squared error (i.e., the difference between the predicted and actual value, squared) and the Anomaly Label which represents an indication of the presence of an anomaly in the data. As it can be observed the squared error is much higher when the anomalies actually occur, giving an indication of the presence of the anomaly in the dataset and determining the Anomaly Label to be 1. The first

Page 53: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 53 of 90

anomaly occurs during the optimization of the threshold and hence the model does not signal it as an anomaly. As it can be observed all other four anomalies have been detected earlier than their occurrence and there were no false positives signalled.

Moreover, the model was applied on the abnormal data with sudden spikes of utilization and the results are presented in Figure 3-3. The squared error suggests that the first and last major spikes in the collected abnormal data are anomalies, while the middle spikes are common patterns in the normal data. It can be observed that the model is not sensible only to spikes, since the middle spikes are not considered anomalies but normal behavioural patterns. We applied the above strategy for determining the threshold and observe the outputs of the anomalies with the Anomaly Label in the lower half of Figure 3-4. Since this dataset is not labelled, the evaluation on this dataset is a visual inspection where we observe the model is also able to detect the anomalies before their occurrence, however the model seems to output a false positive for one of the middle spikes. Depending on the strategy desired (reducing the number of false positives or increasing the number of true anomalies detected), this false positive could be avoided at deployment by setting a minimum number of signals within a window of values (e.g., 3 signals) before setting the Anomaly Label to 1.

Figure 3-2 Actual, predicted and squared error on abnormal data shown for mme_system_cpu_util[user] with anomalies represented by dashed vertical line.

Page 54: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 54 of 90

Figure 3-4 Anomalies detected by the model on abnormal dataset with sudden spikes for mme_system_cpu_util[user].

Figure 3-3 Actual, predicted and squared error on abnormal data shown for mme_system_cpu_util[user].

Page 55: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 55 of 90

Actual items implemented in CogNet For this demonstrator, a large number of existing components within the Fraunhofer Open5GCore were deployed and tested together including:

• The MME-LB which enables the load balancing at the S1-AP level between multiple MMEs • The state sharing between the multiple MMEs enabling the hot standby • The remote console configuration for the MME Load Balancer • The remote console configuration enabling the MME presenting abnormal behaviour to be

rebooted. Additionally to this, a comprehensive anomaly detection mechanism using machine learning was adapted based on the developments in WP3 and considering the time series coming from the live system.

Experience Following the procedure described in Section 4.1., the below outlines the identified sources of experienced assisting the ML in this use case together with the ranking of the importance and the difficulty level of these sources.

Monitoring Source of experience

Description Importance Difficulty

Time End-to-end monitoring may take some seconds while still being scalable for a virtual network

9 9

Task Stale or misconfigured monitoring module[s] 5 1 Context Conflicts with other monitoring tasks 1 5

Detection Source of experience

Description Importance Difficulty

Time A very large number of situations have to be met to be able to gain experience. A system with reduced number of failures will have too few

9 9

Task A large number of false positives will trigger a set of redundant operations.

9 6

Context Contradictory decisions and policy conflicts 5 1 Actuation

Source of experience

Description Importance Difficulty

Time Actuation is running very fast within software networks

7 2

Task The actuation is composed of a set of mitigation actions which may include a large number of components

7 9

Context The actuation process may conflict with other actuation processes running in parallel, resulting in a misconfigured system

9 9

Table 3-1 Rank for different sources of experience

Page 56: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 56 of 90

At the current moment, the monitoring system presents large difficulties in the proper time management of the data, specifically still too much data has to be gathered in order to generate the experience which would overburden the monitoring system which already is at the scalability limit with all the existing parallelization mechanisms. Independent monitoring per tenant may be the only available alternative.

For the detection, a large amount of events has to be gathered to gain the proper insight. However, with the multiple deployments of the same system in successive manner across multiple virtual infrastructures (e.g. the same EPC deployed for multiple operators each with its own infrastructure), the number of negative experience (i.e. false positive anomaly detections) which is the basis for the anomaly detection will become large enough.

For the actuation the most problematic is the mitigation process in which multiple components are involved as this presumes the establishment of a proper procedure to handle the communication while at the same time, due to the inherent duration, it may enter in conflict with other mitigation actions for other control entities. This may be implemented within a complex policy based system able to combine the different mitigation processes in order to satisfy the different requests.

Conclusions and possible next steps In this section the resilience testbed was presented including the machine learning algorithm and the actuation possibility. In the next stage the testbed will be evaluated to prove the added value of the dynamic adaptation brought by the machine learning.

Furthermore, the data accumulated from the multiple testbeds deployed across the world with similar infrastructure will be used to determine currently unknown anomaly types resulting in further updates of the software network components themselves, thus using the machine learning insight from the live systems to update the quality of the implementation itself.

User Manual 3.1.7.1 Download and Installation

o Dependencies

− Python − Scikit-learn

− Keras deep learning library

• Keras is a high-level neural networks API, written in Python and capable of running on top of either TensorFlow or Theano.

The Anomaly Detection module requires Python 3.4. The module source code and sample data can be cloned from:

git clone https://github.com/CogNet-5GPPP/WP5-CSE-Final.git cd WP5-CSE-Final/DTADE/

The module dependencies can be installed using the following commands:

Page 57: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 57 of 90

source activate python3.4 pip install numpy pip install pandas pip install keras pip install scikit-learn

3.1.7.2 Training

The data utilized for this module can be split into two categories: (i) normal data, measuring the normal behaviour of the system, and (ii) abnormal data, containing abnormal patterns of behaviours, such as sudden spikes or more complex contextual or collective anomalies. Figure 3-5 depicts the normal behaviour for one of the metrics measured by the Monitoring component of the Fraunhofer testbed (i.e., the CPU utilization) after removing the missing values. The figure shows that the normal CPU utilization percentage ranges between 0 and 3.5%.

The abnormal data is depicted in Figure 3-6 and Figure 3-7. Figure 3-6 depicts an abnormal dataset where the time of the anomalies is known. The abnormal data was initially cleaned from missing data. Additionally, the dashed lines represent the instances when the anomalies occur (i.e., the labels)23. The second abnormal dataset with sudden spikes has been collected from the testbed and is presented in Figure 3-7. The exact time of the anomalies here is unknown.

23 Note that the labels were not used for training.

Figure 3-5 Normal behaviour for metric mme_system_cpu_util[user] which represents percentage of CPU utilization.

Page 58: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 58 of 90

All the data required for the training phase is given in the folder “data”. The data used for training and validating our approach is from the Fraunhofer testbed which provides a set of realistic time-series datasets.

Figure 3-6 Abnormal data with anomalies injected at the dashed lines at different time as indicated.

Figure 3-7 Abnormal data with sudden spikes of utilization with missing labels from dataset (the time of the anomalies is unknown).

Page 59: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 59 of 90

The repository contains the data sets available and can be found at the following path:

WP5-CSE-Final/DTADE/data/

Our module is pre-trained using the normal dataset from the Fraunhofer testbed. In our case, we apply the model on a sample CPU utilization feature from the normal dataset to check the model performance. In the repository, we provide the data to train the model in the following directory.

WP5-CSE-Final/DTADE/normal_data/

We observe the model performance on the abnormal datasets that can be found in the following directory.

WP5-CSE-Final/DTADE/abnormal_data/

Currently the Keras model specifications for the trained LSTM model can be seen here: ./dtade_architecture.json.

3.1.7.3 Deployment

The Python code for training the anomaly detection ensemble model needs to be invoked by

python dynamic_thresh_ade.py

Sequence length can be input as a model parameter. The default value is 50 in the model.

The key execution steps of the model are shown in the console as the model output in Figure 3-8.

A model was trained on the normal data considering the mme_system_cpu_util. Model Checkpoint was used to get the model with the minimum loss over 150 epochs. The squared error represents the difference between the predicted and actual value and is further used to determine the presence of an anomaly in the data.

Figure 3-8 Run of DTADE with Sample Data

Page 60: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 60 of 90

A threshold is determined for the squared error as follows: for each predicted value, a queue of the prior squared errors is maintained. A scaler is applied to fit the squared errors between 0 and 1. Further if the squared error of the predicted value is higher than 5 times the standard deviation of the prior squared scaled errors then the module signals the point as anomalous. Hence the squared errors and threshold are dynamic and generally change at every prediction to adapt for the new values and increase accuracy. The module is set to wait for a period of 50 timestamps before calculating the standard deviation in order to reduce the number of false positives. This training period is a tunable parameter however we observed a window of 50 timestamps in the considered datasets sufficed.

3.2. SLA

Scope Service providers (SPs) offer their services with a guarantee on the quality captured by SLA. The SLA is a formal contract that specifies among other, the constraints on the low-level objective (SLO) to meet the SLA. The traditional solution for avoiding penalties and respecting the SLA is to overprovision the network with resources. Overprovisioning means for the SPs a waste of resources that can be utilized somewhere else. Furthermore, it can slow down the development of the SPs limiting their maximum number of clients, so that they can be bound to a strict customer/resources ratio.

In the SLA use case, we approach the SLA compliance problem with a proactive angle. We used ANN (Artificial Neural Networks) to predict the evolution of the loads. Moreover, we have defined a set of SLOs that should be monitored and compared to the predicted evolution of metrics to trigger a proactive alert.

Short description (update of D5.2) The deliverable 5.2 didn’t mention the SLA media use case. It instead focuses on anomaly detection in NFV-based environment. The SLA violation study can be considered as a specific form of anomaly detection.

Actual items implemented in CogNet We have implemented in CogNet infrastructure an ANN to test it in real world environment. We have exported the ANN model using python pickle data format. More tests are needed to validate and/or update the model to fit the CogNet specific use case.

Page 61: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 61 of 90

Testbed Implementation

The architecture proposed in Figure 3-9 presents the main steps of the SLA monitoring. We monitored a NFV environment where we instantiate a virtual IMS. The data collector module collects all the raw data and stores them in a Time Series Database (OpenTSDB). The data pre-processing part prepares the data to feed them to the Cognitive Smart Engine (CSE). The CSE performs two main actions: forecasting and SLO violation prediction. We consider adding in the future a third management action for SLA enforcement.

3.2.4.1 Monitoring

Data Collection o Data Characteristics

The collected data (system level and application level) reflects the behaviour of the vIMS at time t, the received data is stored as Time Series with a sampling frequency of 30 seconds. The data are indexed by the Timestamp and are identified by the VNFC and metric name. The timestamp corresponds to the observation and has an associated value with respect to the metric and VNFC. In the pre-processing phase, the resulting data is a matrix of all the metrics for all VNFCs for the observation window.

The observation window: 2 months

Number of entry lines per metric: 200.000

Figure 3-9 SLA Architecture

Page 62: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 62 of 90

The sampling frequency: 30 seconds

Number of raw metric per VNFC: 30 metrics

Number of feature per VNFC: 26 metrics

Number of total features: 26 metrics * 6 VNFC = 156 features24

Table 3-2 Data summary

Metric Name Semantics cpu.idle_perc Percentage of time the CPU is idle when no I/O

requests are in progress cpu.wait_perc Percentage of time the CPU is idle AND there is at

least one I/O request in progress cpu.stolen_perc Percentage of stolen CPU time, i.e. the time spent in

other OS contexts when running in a virtualized environment

disk.total_space_mb The total amount of disk space aggregated across all the disks on a particular node. NOTE: This is an optional metric that is only sent when send_rollup_stats is set to true.

disk.total_used_space_mb The total amount of used disk space aggregated across all the disks on a particular node. NOTE: This is an optional metric that is only sent when send_rollup_stats is set to true.

io.read_kbytes_sec Kbytes/sec read by an IO device io.read_req_sec Number of read requests/sec to an IO device load.avg_1_min The average system load over a 1 minute period mem.swap_free_perc Percentage of free swap memory that is free net.in_bytes_sec Number of network bytes received per second net.out_bytes_sec Number of network bytes sent per second net.in_packets_sec Number of network packets received per second net.out_packets_sec Number of network packets sent per second net.in_errors_sec Number of network errors on incoming network

traffic per second

Table 3-3 Subset of Monasca monitored metrics

o Data Pre-processing

24 features = metrics

Page 63: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 63 of 90

The data preparation phase is a crucial phase for determining the type of data under study, clean it from noise and redundancy, prepare and make the data exploitable by the learning algorithms.

• Data cleaning Time series preprocessing reduces entry line from more than 400.000 lines to 200.000 lines. In this phase, the majority of the discarded lines are either redundant entries or non-exploitable errors or missing values.

• Dimension reduction (data segmentation) The dimension reduction in our case refers to reducing the number of entries while keeping the properties of the time series. This technique consists of eliminating few samples in the data that do not entail a change in the form of the curves. It is formally described as follows: Given a time series 𝑇𝑇1, with 𝑚𝑚 datapoint, generate a new time series 𝑇𝑇2 with 𝑝𝑝 data points such as 𝑝𝑝 < 𝑚𝑚 and 𝑇𝑇1 approximates 𝑇𝑇2 . In this phase we reduce the data size from 200.000 steps to 177.000 entries.

• Autocorrelation and lagged function The autocorrelation represents the correlation of a Series with respect to itself delayed in time. The autocorrelation allows us to draw lagged function, which serves as an indicator to the predictability of the metric. For stochastic random noise (i.e. White noise) the autocorrelation is small and constant which means that it is not predictable. However, for cosine function, the autocorrelation is very high (near 1) and constant in multiple time steps, which means that knowing previous values one can predict theoretically to signal evolution to infinity. In our example, the autocorrelation shows that for most metrics we can predict one step-a-head values with more than 80% and then drops to 60% for the two steps-a-head values, this is due to the accumulated effect of uncertainty. The insights revealed by the autocorrelation is answering the question whether the data is forecastable

or not and to which extent. For our case, the answer is a clear yes for the step-a-head value.

Figure 3-10 Example of time series decomposition

Page 64: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 64 of 90

• Stationary data and decomposition A time series can be decomposed into 4 sub-time series, one that capture the general trend, the white noise, the absolute signal itself and the seasonality effect that is the repeating cycle as shown in Figure 3-10.

Decomposing the metrics is essential to understanding the problem before the prediction phase. Each component of the time Series can be approached as a unique problem with different algorithms. In this phase, we discarded the general trend of the time series and focus on the remainder of the signals in order to capture the evolution of the metrics.

Moreover, all the metrics that we collect have very different scales which make them difficult to compare and can ultimately yield a poor performance in the training and forecasting phase. One solution to this problem is to make all the metrics stationary, that is rescale them with common mean of 0 and standard variation of 1 (shown in Figure 3-11). Another solution to compare multiple different time series is to divide all the data by their first value to obtain a common starting position from one as shown in Figure 3-12 d.

• Data visualization (correlation, box plot) Inter-VM correlation (Figure 3-12 b): This draw the representation

between all the features of all the VNFCs. We can rely on these calculations to draw correlations for a specific service in this case,

Figure 3-11 Re-scale to make time series stationary: mean to 0 and standard deviation to 1

Page 65: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 65 of 90

communication service. An example is the correlation of network_packets_in of the proxy and the cpu_perc of the database.

Intra-VM correlation (Figure 3-12 a): This allows to draw correlations between metrics related to a specific VNFC. Note that the correlation is computed based on an observation window without anomalies or SLO violations, to exhibit all the normal correlation of the variables. Also, the correlation matrix is specific to a given service, in Figure 3-12 b the correlation matrix is tightly coupled to the IMS service. An intuitive example is the correlation between “Network.Packets.In” and “Network.Packets.Out” in the Proxy VNFC. Tracking this correlation for example can give us an insight or an early alarm wherever a de-correlation happened. For example, if in the proxy the Network.in becomes decorrelated with the Network.Out that might be a first symptom of an anomaly and worth more investigation. This method has been used by A. Antonescu et al. to define new SLOs i.e. pair of correlated metrics. One should keep in mind that correlation does not imply causation, to verify the causal link more controlled experiments should be performed.

• Data Labelling Machine learning approaches are about generating models or programs based on inputted data. For sake of simplicity, we can distinct two broad input structures, Unlabelled data for exampling entering only the metrics values, this

type is called non-supervised learning and is mainly concerned with clustering, and

Figure 3-12 (a) - Intra VNFC correlations. (b) Inter VNFC correlations. (c) - Box plot of the Bono VNFC system-level metrics rescaled between 0 and

10. The blue box marks the quartiles of the distribution; the red line is the median and the crosses show the outliers. (d) Dividing Time series by

first value to track the evolution.

Page 66: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 66 of 90

Supervised learning whereby the inputs are labelled to a specific class. We formalized our approach as a supervised machine learning technique, which is for each entry line composed of all the metrics we associate a binary value that labelled the line (i.e. the set of metrics values).

We used the mathematical formalization of the SLA and SLO Section 3.2.4.2, that describe the SLA violation as at least one SLOV and the functions 𝐹𝐹1 and 𝐹𝐹2 that maps low-level metrics to the SLO as shown in the next section. The definition of the functions can be defined empirically or found using machine learning techniques that map the observed SLOs to the variation of the metrics. Labelling each row with a binary variable will result is a vector Y (Equation 1) with value {0, 1} that maps all the input matrix M of 156 features (Equation 2).

𝑌𝑌 =

𝑌𝑌1𝑌𝑌2𝑌𝑌3..𝑌𝑌𝑛𝑛

Equation 1 Vector of target values

𝑋𝑋 =

𝑚𝑚1,1 𝑚𝑚1,2 𝑚𝑚1,3 . . 𝑚𝑚1,156𝑚𝑚2,1 𝑚𝑚2,2 𝑚𝑚2,3 . . 𝑚𝑚2,156𝑚𝑚3,1 𝑚𝑚3,2 𝑚𝑚3,3 . . 𝑚𝑚3,156

. . . . . .

. . . . . .𝑚𝑚𝑚𝑚,1 𝑚𝑚𝑚𝑚,2 𝑚𝑚𝑚𝑚,3 . . 𝑚𝑚𝑚𝑚,156

Equation 2 Input matrix structure

This method allow the ANN to learn how to classify the data into two SLA violation states, however, this data labelling structure is not enough for the ANN to learn forecasting and thus correctly classifying the SLA violation before it occurs. For the ANN to learn simultaneously to classify and forecast, we lagged the Y vector by one step so that the rows of the matrix X become labelled with the label of the next row. The number of steps lagged depends on the results of the autocorrelation, which is the reason why we select a lag of one step.

3.2.4.2 Detection – ML implementation

In Deliverable D4.2, the SLA Management in the NFV context are

• A novel cognitive SLA Management Framework for Programmable Networks (SDN/NFV),

Page 67: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 67 of 90

• A comparative study of two ANN architectures and how we can leverage them with deep learning to the SLA compliancy problem in a proactive fashion.

SLO is the basic measurable constraint to measure the compliancy of the network state with the SLA agreement. In this deliverable we used minimum throughput and response time as an SLO example upon which to base the service quality

Throughput: Throughput of transaction e.g. accessing databases, sending/receiving traffic should be more than 200 kbps for at least 70% of the traffic averages in 10 minutes.

Response Time: Response time should be less than 1 sec for 80% of requests, measured at 30 minutes of intervals.

Let 𝑀𝑀 be the observable variables for 𝑉𝑉𝑀𝑀𝑖𝑖 𝑀𝑀 = {𝑚𝑚𝑖𝑖1 , 𝑚𝑚𝑖𝑖2, 𝑚𝑚𝑖𝑖3, … , 𝑚𝑚𝑖𝑖𝑛𝑛}

And let VM be the machines 𝑀𝑀𝑚𝑚𝑡𝑡�������⃗ = {𝑉𝑉𝑀𝑀1, 𝑉𝑉𝑀𝑀2, 𝑉𝑉𝑀𝑀3, … ,𝑉𝑉𝑀𝑀𝑚𝑚}

SLO space is 𝑆𝑆𝑖𝑖 = { 𝑆𝑆+, 𝑆𝑆− } for 𝑚𝑚 SLOs,

V = �𝛾𝛾(𝑆𝑆𝑖𝑖), 𝑆𝑆𝑖𝑖 = < 𝑀𝑀𝑚𝑚𝑡𝑡, 𝑆𝑆𝑆𝑆𝑂𝑂𝑖𝑖 >𝑛𝑛

0

∃𝑀𝑀∗�����⃗ ∈ 𝑀𝑀��⃗ , 𝑠𝑠𝑢𝑢𝑠𝑠ℎ 𝑚𝑚𝑠𝑠 m𝑚𝑚𝑚𝑚(𝑘𝑘,𝑀𝑀∗�����⃗ )

[𝑠𝑠𝑐𝑐𝑐𝑐𝑐𝑐�𝑀𝑀∗�����⃗ , 𝑆𝑆𝑆𝑆𝑂𝑂𝑉𝑉𝑖𝑖� − 𝛼𝛼] ; 𝑘𝑘 as 𝑀𝑀∗ elements 𝑘𝑘 <= 𝑚𝑚 and 𝛼𝛼 : is the correlation

threshold.

3.2.4.3 Actuation

Ponder2 specification

Page 68: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 68 of 90

The Ponder225 framework (in Figure 3-13) and its components are listed in the Table 3-4 .

Input Capability Output

Alarm (java, XML)

Supervision actions

Discovery *Policy Event that correspond to the alarm

which capture out by WireShark.

*Policy Action: is selected from the list of

supervision actions

Set of policies {p1, p2, pN} Conflicts

resolution

One Policy with high priority

Selected Policy PEP Policy Action

Table 3-4 Components of Ponder 2

Step 1: By using DSL, our pre-defined policy (.p2 file) can be created and saved in the

corresponding directory.

Step 2: Wireshark detects the alarms from OpenIMS platform by monitoring the network traffic

(SIP traffic).

25 http://ponder2.net/

Figure 3-13 Ponder2 framework

Page 69: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 69 of 90

Step 3: Wireshark passes the detected alarm (.xml file) as event to Ponder2 framework.

Step 4: Ponder2 discovers this event by two methods which are RMI together with xmlBlaster

and Web Service.

In our case, Ponder2 has an XML managed object which holds XML and can apply XPath queries

on it. The XML is a Ponder2 basic managed object and therefore does not need to be loaded.

An XML managed object can be created from a String by sending it the "asXML" unary

message. Hence, we create our XML event and the configuration XML file of OpenIMS directly

within Ponder2.

Step 5: While Ponder2 receives the event from Wireshark, the pre-defined policy will be

triggered if its conditions are satisfied. With the help of Ponder2’s enforcement point, the

actions will be performed on the corresponding Managed Object (.xml file)

Step 6: The OpenIMS will be restarted and reconfigured. The alarm will be disappeared when

it finished its reconfiguration.

Note: RMI is Java Remote Method Interaction. XmlBlaster is a publish/subscribe and point to point 100% Java based MOM server (message-oriented middleware) which exchanges messages between publishers and subscribers. The message is described with XML-encoded meta information. Messages may contain everything, GIF images, Java objects, Python scripts, XML data, a word document, plain text - just anything.

• Communication with the server is based on socket, CORBA (using JacORB), RMI, XmlRpc, HTTP or email, clients are free to choose their preferred protocol. Other protocols like SOAP may be plugged in.

• Subscribers can use XPath expressions to filter the messages they wish to receive.

Figure 3-14 Relationship between RMI and XML Blaster

Page 70: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 70 of 90

Wireshark is a free and open-source packet analyser. It is used for network troubleshooting,

and analysis. We are using Wireshark to monitor the traffic of the OpenIMS and save it locally.

When it detects, some alarms will occur and it will change the attributes of Structure DSL which

will generate Ponder2 events. When Ponder2 receives events, it will evaluate the conditions of

corresponding policy. Ponder2 will force the policy which its conditions are satisfied. The

actions will include changing the attributes of Structure DSL as well as other pre-defined

supervision actions.

Initial Experimentation results The initial experiment results are about training LSTM to forecast the evolution of metrics (Figure 3-15, Table 3-5) and a classification part based on a decision tree (Table 3-6).

The initial results are based on a 4 Gbits data set representative of the vIMS use case.

The evaluation of the proposed Framework is performed in two distinct phases. (1) Evaluation of the forecasting accuracy, and (2) Evaluation of SLOV identification. Each phase is mapped to an evaluation metric. In (1), the evaluation is based on the difference between the predicted value 𝑣𝑣 in a time 𝑡𝑡 and its expected value at the same time. The more this value is low, the better is the result. In the literature, this error metric in Equation 3 is known as RMSE – Root Mean Square Error. RMSE was widely adopted to evaluate the accuracy of ANN. We adopted it to compare different LSTMs architectures.

𝑇𝑇𝑀𝑀𝑆𝑆𝑇𝑇 = �∑ (ℎ(𝑡𝑡)−𝑦𝑦(𝑡𝑡))𝑛𝑛𝑡𝑡=1

2

𝑛𝑛

Equation 3 RMSE - Root Mean Square Error

Before evaluating phase (2), we labelled the training dataset according to the three SLOs classes. This dataset will serve as the training data for the Decision Tree algorithm. After learning the model, i.e. graph, we evaluate it using precision and recall metrics. These metrics are essential when working with skewed classes (i.e. when the number of one class, no SLO breach, largely outnumbers the other classes (SLO breach)).

The precision metric evaluates the accuracy of our predictions in detecting correctly SLO breaches. It is given by Equation 4, where, True positive are the correctly identified SLO breaches and false positive, false/incorrect SLO breaches are identified by the system. The higher the precision, the better is the result. On the other hand, recall answers the question of all SLO breaches what fraction was correctly detected by the algorithm. Recall is given by the second formula in Equation 4, where false negative is the SLO breaches undetected by our framework.

𝑃𝑃𝑐𝑐𝑃𝑃𝑠𝑠𝑖𝑖𝑠𝑠𝑖𝑖𝑐𝑐𝑚𝑚 = 𝑇𝑇𝑐𝑐𝑢𝑢𝑃𝑃 𝑝𝑝𝑐𝑐𝑠𝑠𝑖𝑖𝑡𝑡𝑖𝑖𝑣𝑣𝑃𝑃

𝑇𝑇𝑐𝑐𝑢𝑢𝑃𝑃 𝑝𝑝𝑐𝑐𝑠𝑠𝑖𝑖𝑡𝑡𝑖𝑖𝑣𝑣𝑃𝑃 + 𝐹𝐹𝑚𝑚𝐹𝐹𝑠𝑠𝑃𝑃 𝑝𝑝𝑐𝑐𝑠𝑠𝑖𝑖𝑡𝑡𝑖𝑖𝑣𝑣𝑃𝑃

𝑇𝑇𝑃𝑃𝑠𝑠𝑚𝑚𝐹𝐹𝐹𝐹 = 𝑇𝑇𝑐𝑐𝑢𝑢𝑃𝑃 𝑝𝑝𝑐𝑐𝑠𝑠𝑖𝑖𝑡𝑡𝑖𝑖𝑣𝑣𝑃𝑃

𝑇𝑇𝑐𝑐𝑢𝑢𝑃𝑃 𝑝𝑝𝑐𝑐𝑠𝑠𝑖𝑖𝑡𝑡𝑖𝑖𝑣𝑣𝑃𝑃 + 𝐹𝐹𝑚𝑚𝐹𝐹𝑠𝑠𝑃𝑃 𝑝𝑝𝑐𝑐𝑠𝑠𝑖𝑖𝑡𝑡𝑖𝑖𝑣𝑣𝑃𝑃

Page 71: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 71 of 90

𝐹𝐹 − 𝑠𝑠𝑠𝑠𝑐𝑐𝑐𝑐𝑃𝑃 = 2 ∗ 𝑇𝑇𝑃𝑃𝑠𝑠𝑚𝑚𝐹𝐹𝐹𝐹 ∗ 𝑃𝑃𝑐𝑐𝑃𝑃𝑠𝑠𝑖𝑖𝑠𝑠𝑖𝑖𝑐𝑐𝑚𝑚𝑇𝑇𝑃𝑃𝑠𝑠𝑚𝑚𝐹𝐹𝐹𝐹 + 𝑃𝑃𝑐𝑐𝑃𝑃𝑠𝑠𝑖𝑖𝑠𝑠𝑖𝑖𝑐𝑐𝑚𝑚

Equation 4 Precision, Recall, F-Score equations

Additionally, and in order to have a single number to evaluate our framework, we used the F-score metric. F-score is a robust metric to generate a single value metric from two variables.

The whole evaluation of our framework was carried out in offline mode, i.e. using data gathered on a period of eight weeks, with known and predefined SLO breaches (using fault injection techniques). The data set was divided into 2 subsets, namely the training set approximately 70%, and a testing set of 30%. The data set comprises of 3 anomaly classes, i.e. 3 different SLO Breaches. The results are depicted in Table 3-6.

In Table 3-5, a batch size of one yields the best results over batches of 50 and 10 respectively.

Overall, the ANN training time is time consuming and is highly correlated with its architecture especially the number of cells in the hidden layer. Moreover, we noticed that after a certain RMSE value adding LSTM cells does not significantly improve the forecasting accuracy capabilities and can backfire into over fitting i.e. difficulty to generalize to unseen data.

The Table 3-6 shows the promising performance of the ID3 algorithm when applied to our prediction problem. The minimum accuracy (F-score) found is 0.843 that corresponds to binary classes (either SLOV or not). We progressively added SLOs to measure how the MCDT would handle different classes. The results show that the accuracy improves with additional SLO classes. This can be explained by the variation of the precision and recall metrics with respect to each new SLO class. Overall, the MCDT performs well on the testing set; however a more thorough testing is necessary over longer period of time and with more SLOs.

Batch Size Epoch Layers Training Time RMSE

Figure 3-15 The Incoming signal, Training data, Test set are indicated by Blue, Green and Red colors respectively

Page 72: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 72 of 90

1 5 1 2min 12sec 3.95

1 60 4 22min 54sec 3.82

1 500 4 3h 19min 12sec 3.58

10 5 1 2min 26sec 10.62

10 60 4 2min 12sec 11.51

10 500 4 16min 43sec 17.38

50 5 1 37sec 12.01

50 60 4 1min 3sec 11.50

50 500 4 4min 10 sec 11.48

Table 3-5 Comparison of results yielded from different LSTM architectures

SLO Classes Number of SLOs

Precision Recall F-score

1 60 0.822 0.879 0.849

2 180 0.890 0.917 0.903

3 260 0.909 0.913 0.910

Table 3-6 MCDT Evaluation of the Test Set

Experience Following the procedure described in Section 4.1., the below outlines the identified sources of experienced assisting the ML in this use case together with the ranking of the importance and the difficulty level of these sources.

Monitoring

Source of experience

Description Importance Difficulty

Time Lack or latency of monitoring data or wrong time granularity of monitoring data

5 6

Task Stale or misconfigured monitoring module[s] 6 6

Context Conflicts with other monitoring tasks, lack of monitoring capacity

5 6

Detection

Source of experience

Description Importance Difficulty

Page 73: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 73 of 90

Time High latency of detection e.g. due to wrongly selected detection algorithm

8 7

Task Stale or misconfigured detection module[s] 7 8

Context Conflicts with other detection algorithms, lack of detection capacity (e.g. compute resources)

8 7

Actuation

Source of experience

Description Importance Difficulty

Time Failure to invoke needed action (policy) in timely manner (late invocation and/or too early invocation)

7 9

Task Invocation of a wrong action (policy) 8 8

Context Conflicts with other actions, e.g. policy conflicts; lack of policy domains of needed granularity

8 7

Table 3-7 Rank for different sources of experience

Conclusions and possible next steps We foresee as a next step of the Media SLA use case, the use of reinforcement learning techniques to update the ML model over time. Another possibility is to use regular learning technique and updating the model automatically by training constantly on new data set.

User Manual 3.2.8.1 Traffic Generation

We have generated the SIP traffic using SIP to emulate SLO violations. The stress test is a VM that established multiple calls against Sprout Node.

The script that runs on the stress VM emulates a P-CSCF and sends the traffic to Sprout26, stress testing the IMS Core directly.

The logs can be found in /var/log/clearwater-sip-stress and also print a human-readable summary of the stress (with information about percentage of failed calls, average latency, and so on).

Manual Install:

• Create a new VM and configure it to access the project Clearwater repository. • Set the following property in /etc/clearwater/local_config:

local_ip - the local IP address of this node

26 https://github.com/Metaswitch/sprout

Page 74: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 74 of 90

• Run sudo apt-get install clearwater-sip-stress-coreonly to install the Debian packages.

• You need to open the following TCP ports: 5052, 5054 and 5082. • To start the test run: /usr/share/clearwater/bin/run_stress <home_domain>

<number of subscribers> <duration in minutes>

3.2.8.2 Client configuration

Click on Signup to create a number for your client as shown in Figure 3-17:

Figure 3-16 Clearwater login dashboard

Figure 3-17 Clearwater account configuration

Figure 3-18 Creating subscribers

Page 75: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 75 of 90

Use the given number to register the client in any softphone software, e.g., Zoiper, X-Lite, or Bria: Make a call between two SIP agents

To run Clearwater live tests

Log onto any node in the deployment and run the commands below:

curl -sSL https://get.docker.com/ | sh docker pull opnfv/functest docker run --dns=84.39.33.82 -it opnfv/functest /bin/bash cd ~/repos/vims-test source /etc/profile.d/rvm.sh rake test[reflexion.nccc] SIGNUP_CODE=secret

3.2.8.3 Monitoring configuration (Monasca)

The installation of Monasca monitoring service is complex and demands the installation of multiple packages:

Figure 3-19 Configuration Android SIP client

Page 76: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 76 of 90

apt-get install -y git apt-get install openjdk-7-jre-headless python-pip python-dev

Install mysql database

apt-get install -y mysql-server

Install Zookeper

apt-get install -y zookeeper zookeeperd zookeeper-bin service zookeeper restart

Install and configure kafka

wget http://apache.mirrors.tds.net/kafka/0.8.1.1/kafka_2.9.2-0.8.1.1.tgz mv kafka_2.9.2-0.8.1.1.tgz /opt cd /opt tar zxf kafka_2.9.2-0.8.1.1.tgz ln -s /opt/kafka_2.9.2-0.8.1.1/ /opt/kafka ln -s /opt/kafka/config /etc/kafka

Create kafka startup scripts in /etc/init/kafka.conf

description "Kafka" start on runlevel [2345] stop on runlevel [!2345] respawn limit nofile 32768 32768 # If zookeeper is running on this box also give it time to start up properly pre-start script if [ -e /etc/init.d/zookeeper ]; then /etc/init.d/zookeeper restart fi end script # Rather than using setuid/setgid sudo is used because the pre-start task must run as root exec sudo -Hu kafka -g kafka KAFKA_HEAP_OPTS="-Xmx1G -Xms1G" JMX_PORT=9997 /opt/kafka/bin/kafka-server-start.sh /etc/kafka/server.properties

Create Kafka topics

/opt/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 64 --topic metrics /opt/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 12 --topic events

Page 77: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 77 of 90

/opt/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 12 --topic raw-events /opt/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 12 --topic transformed-events /opt/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 12 --topic stream-definitions /opt/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 12 --topic transform-definitions /opt/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 12 --topic alarm-state-transitions /opt/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 12 --topic alarm-notifications /opt/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 12 --topic stream-notifications /opt/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 3 --topic retry-notifications

Install and configure influxdb

curl -sL https://repos.influxdata.com/influxdb.key | apt-key add - echo "deb https://repos.influxdata.com/ubuntu trusty stable" > /etc/apt/sources.list.d/influxdb.list apt-get update apt-get install -y apt-transport-https apt-get install -y influxdb service influxdb start

Install Monasca API packages

pip install monasca-common pip install gunicorn pip install greenlet # Required for both pip install eventlet # For eventlet workers pip install gevent # For gevent workers pip install monasca-api pip install influxdb

On OpenStack controller node, create Monasca user password, assign admin role for user Monasca in tenant service.

openstack user create --domain default --password qydcos monasca openstack role add --project service --user monasca admin openstack service create --name monasca --description "Monasca monitoring service" monitoring create endpoint openstack endpoint create --region RegionOne monasca public http://192.168.1.143:8082/v2.0 openstack endpoint create --region RegionOne monasca internal http://192.168.1.143:8082/v2.0

Page 78: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 78 of 90

openstack endpoint create --region RegionOne monasca admin http://192.168.1.143:8082/v2.0

Note here that 192.168.1.143 is the floating IP of my API VM address, change it to yours.

Create Monasca API startup scripts:

vim /etc/init/monasca-api.conf # Startup script for the Monasca API description "Monasca API Python app" start on runlevel [2345] console log respawn setgid monasca setuid monasca exec /usr/local/bin/gunicorn -n monasca-api -k eventlet --worker-connections=2000 --backlog=1000 --paste /etc/monasca/api-config.ini

Launch Monasca with the following command:

monasca-setup -u monasca -p qydcos --user_domain_id e25e0413a70c41449d2ccc2578deb1e4 --project_domain_id e25e0413a70c41449d2ccc2578deb1e4 --user monasca \ --project_name service -s monitoring --keystone_url http://192.168.1.11:35357/v3 --monasca_url http://192.168.1.143:8082/v2.0 --config_dir /etc/monasca/agent --log_dir /var/log/monasca/agent --overwrite

Page 79: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 79 of 90

4. Types of Experience

4.1. Introduction Additionally to the functional components (Monitoring, Detection, Actuation) described in detail within the sections on respective testbeds, any ML-based decisions are inheritably depending on the fourth, non-functional component, which is Experience. The need for this non-functional component is clear from a relatively old yet frequently quoted, and formal definition of ML:

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P,

improves with experience E."27.

In the above definition the T and P are conventionally - in network management - considered to be a set of management tasks and a set of KPI's. Important is to keep in mind that the both sets in practice consist of items - tasks and KPI'S; respectively - that are not orthogonal due to multiple, partially, historical28 reasons.

In order to have a rough idea of the (software) experience concept let us consider, first, how the experience can be implemented, and, second, what are the sources from which the software gains its experience. It appears almost natural to consider for a former a database with some records annotated by management tasks, while for the latter it appears useful to "learn from failures". Without pretending to be able to address all possible software failures let us nevertheless consider failures - experience sources during the three functional phases - Monitoring, Detection, and Actuation - each of which can be due to certain errors in time, (management) task, and (management) context.

The above gives us certain software experience structure consisting of nine interdependent records of our possible experience database. In Table 4-1, some descriptions and examples clarify possible source of experience.

Monitoring Source of experience

Description (example) Importance Difficulty

Time Lack or latency of monitoring data or wrong time granularity of monitoring data

0..9 0..9

Task Stale or misconfigured monitoring module[s] 0..9 0..9

27 Mitchell, T. (1997). Machine Learning. McGraw Hill. p. 2. ISBN 0-07-042807-7 28 For example, a set of KPI's deployed by a particular network operator depends on the evolution of its service portfolio

.

Page 80: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 80 of 90

Context Conflicts with other monitoring tasks, lack of monitoring capacity

0..9 0..9

Detection Source of experience

Description (example) Importance Difficulty

Time High latency of detection e.g. due to wrongly selected detection algorithm

0..9 0..9

Task Stale or misconfigured detection module[s] 0..9 0..9 Context Conflicts with other detection algorithms, lack of

detection capacity (e.g. compute resources) 0..9 0..9

Actuation Source of

experience Description (example) Importance Difficulty

Time Failure to invoke needed action (policy) in timely manner (late invocation and/or too early invocation)

0..9 0..9

Task Invocation of a wrong action (policy) 0..9 0..9 Context Conflicts with other actions, e.g. policy conflicts; lack

of policy domains of needed granularity 0..9 0..9

Table 4-1 Experience template

4.2. Per use-case types of experience The exercise of completing the experience template was given to the main developers of each of the specific testbeds to gain insight from their expert perspective while implementing the machine learning solutions concentrating on the directions of importance and difficulty to implement from their momentary perspective. As only a limited number of experts were completing this template, the results may highly vary from the results of a comprehensive questionnaire which would give a proper mirror on how experience is seen by the networking community.

In the following, the experience tables coming from each of the use cases are grouped and evaluated altogether and several conclusions in regard to the next evaluation phase of the machine learning testbeds presented will concentrate on.

First, we observe that almost all use case owners have agreed on the suggested taxonomy of experience sources, namely with the template given in Table 4-1, however since descriptions were partially modified to tailor for use case specific features we list these modifications without numerical evaluation but with the following use case labels:

1. DSE - Distributed Security Enablement,

2. HN - Honeynet,

3. SAD - NFV Security Anomaly Detection,

4. NR - Network Resilience

5. SLA - Service Level Agreements.

Page 81: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 81 of 90

The Use Case interpretations of experience are grouped into three functional phases, which demonstrate both - the consensus on certain experience descriptions as well as the differences in experience interpretations per use case.

Monitoring Source of experience

Use Case Descriptions Use Case

Time Lack of sufficient monitoring data due to incorrect or insufficient sampling frequency e.g. SFlow

DSE

Lack of monitoring data or wrong time granularity of monitoring data, e.g netflow sampling, difficulty example: detect a failure in the data source

HN

Lack of monitoring data due to difficulty to efficiently extract certain metrics

SAD

End-to-end monitoring may take some seconds while still being scalable for a virtual network, difficulty example: detect a failure in the data source

NR

Lack or latency of monitoring data or wrong time granularity of monitoring data

SLA

Task Stale or misconfigured monitoring module[s] DSE, HN, SAD, NR,

SLA Context Lack of monitoring capacity, resource limitations or external

monitoring conflicts DSE

Conflicts with other monitoring tasks HN, NR Lack of monitoring capacity (Compute resources), big amount of stored/sent database data

SAD

Conflicts with other monitoring tasks, lack of monitoring capacity SLA

Table 4-2 Use case based interpretation of different source of experience on Monitoring

Detection Source of experience

Use Case Descriptions Use Case

Time High latency of detection due to dynamic traffic patterns, increased metrics, incorrect algorithms

DSE

High latency of detection cause by flows time life (can be several hours)

HN

High latency of detection due to data polling request latency (several seconds instead of immediate)

SAD

A very large number of situations have to be met to be able to gain experience. A system with reduced number of failures will have too few

NR

Page 82: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 82 of 90

High latency of detection e.g. due to wrongly selected detection algorithm

SLA

Task Stale or misconfigured detection module[s] DSE, HN, SAD, SLA

A large number of false positives will trigger a set of redundant operations.

NR

Context Conflicts with other detection algorithms, cross talk or cross contamination of the monitored infrastructure

DSE

Conflicts with other detection algorithms, lack of detection capacity (no accurate algorithm)

HN

Conflicts with other detection algorithms, lack of detection capacity (low algorithm accuracy)

SAD

Contradictory decisions and policy conflicts NR Conflicts with other detection algorithms, lack of detection capacity (e.g. compute resources)

SLA

Table 4-3 Use case based interpretation of different source of experience on Detection

Actuation Source of experience

Use Case Descriptions Use Case

Time Failure to invoke needed action (policy) in timely manner (slow invocation, seconds from monitoring to actuation)

DSE, SAD

Actuation is running very fast within software networks NR Failure to invoke needed action (policy) in timely manner (late invocation and/or too early invocation)

SLA

Task Invocation of the most appropriate action (policy) DSE, SAD The actuation is composed of a set of mitigation actions which may include a large number of components

NR

Invocation of a wrong action (policy) SLA Context Policy conflicts, multiple actuations on the same infrastructure with

opposing goals DSE

Policy conflicts, lack of policy domains of needed granularity SAD, SLA The actuation process may conflict with other actuation processes running in parallel, resulting in a misconfigured system

NR

Table 4-4 Use case based interpretation of different source of experience on Actuation

Bearing in mind the above, sometimes tiny, differences in interpretations of experience per use case we now compare the impact of the type of experience per functional phase (monitoring, detection, actuation) further divided into the three facets (Time, Task, Context) across all considered use cases. The impact computation is based on the experts evaluation of the Importance and the Difficulty of a particular source of experience and is given by

𝐼𝐼𝑚𝑚𝑝𝑝𝑚𝑚𝑠𝑠𝑡𝑡 = 𝐼𝐼𝑚𝑚𝑝𝑝𝑐𝑐𝑐𝑐𝑡𝑡𝑚𝑚𝑚𝑚𝑠𝑠𝑃𝑃 ∗ 𝐷𝐷𝑖𝑖𝐷𝐷𝐷𝐷𝑖𝑖𝑠𝑠𝑢𝑢𝐹𝐹𝑡𝑡𝐷𝐷

Page 83: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 83 of 90

Monitoring The following three charts compare the impact of Time, Task and Context as sources of ML experience during the monitoring phase across the five use cases, the last column is the mean impact value computed as a product of average importance and average difficulty for all use cases.

Figure 4-1 Comparison based on impact of Time, Task and Context as sources of ML experience during Monitoring

Page 84: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 84 of 90

Detection The following three charts compare the impact of Time, Task and Context as sources of ML experience during the detection phase across the five use cases, the last column is the mean impact value computed as a product of average importance and average difficulty for all use cases.

Figure 4-2 Comparison based on impact of Time, Task and Context as sources of ML experience during Detection

Page 85: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 85 of 90

Actuation The following three charts compare the impact of Time, Task and Context as sources of ML experience during the actuation phase across the five use cases, the last column is the mean impact value computed as a product of average importance and average difficulty for all use cases. Note that for the Honeynet use case the actuation phase was considered as not a relevant source of experience.

Figure 4-3 Comparison based on impact of Time, Task and Context as sources of ML experience during Actuation

Page 86: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 86 of 90

4.3. Executive summary and Conclusions We introduce a concept of experience - a metric, through which the evaluation of ML-assisted network management can be performed. We attempt to explain the importance of formal and task-specific, and performance-specific definition of experience; additionally, we attempt to justify key components of experience and put them into a template.

Further, testbed (use-case) owners attempted to fill in the template as it seems adequate for a particular use case. This exercise was important, because ML-assisted network management is a novel field: contrary to, say, ML-based image recognition, CogNet must recognise the network situation (attack, anomaly, traffic pattern and the like), in which a particular action is required. In CogNet, this situation is not "given" a priori like a source data in an image recognition task, hence monitoring, for example, must be instructed on what and how to monitor. Similar comments are valid for the detection and actuation processes. We use our different testbeds and respective management tasks to cover the field of experience by expert opinions of testbed owners: therefore comments, additions, doubts and counter arguments made by the experts are extremely important for our future work.

Finally, we try to generalise our experience in attempts to define CogNet experience concept and make pragmatically recommendations for future work. In particular, thinking beyond the CogNet project, we try to see the value of the experience concept within a product backlog29 and to identify respective usage patterns in a product design.

The comparison of numerical values of experience impact in different phases of ML-based network management demonstrates that our use cases are sufficiently heterogeneous from the experience viewpoint. It remains to be seen, however, whether the use case owners after studying the results of the above evaluation and use-case comparison will not adjust their evaluations of Importance and Difficulty as it appears usual for all types of Delphi analysis.

29 This is the key concept of SCRUM

Page 87: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 87 of 90

5. Conclusions and Further Work This deliverable continues the description of the developments started in D5.2 in regard to the specific management system where the machine learning techniques can be applied and could provide positive results as well as the development of the machine learning techniques based on the data acquired from the different testbeds and proof-of concept mitigation actions, providing a best-practice guidance on how the system should use the machine learning insight.

The systems were developed on an end-to-end basis including further refinement of the monitoring solution, the further adaptation of the machine learning mechanisms using as basis the developments of the WP3 and developing exemplary actuations. As the number of possible actions is very large and the effect of multiple actions at the same time across a system was not studied, an approach on which only some significant ones are chosen was taken.

The testbeds are representing mainly the state-of the art deployments for SDN/NFV environments including Service Function Chaining, embedded SDN within data center networks, software implementation of the packet core and software IMS implementation. With this, a very large coverage of the management infrastructure of the 5G networks was obtained, thus aiming to prove that resilience and security can be solved with machine learning for the different domains.

The testbeds here presented are matching the different security and reliability issues with some of the reference software implementations as a proof-of concept. If needed, the advancements in one or other of the areas can be easily transported to another, through this being able to exploit the results into the other domains. More considerations on how complex is the porting to other domains and the assessment of the possible benefits will be presented into the next deliverable.

In the next steps the testbeds will be evaluated on an end-to-end basis including the capabilities of the different functional elements and the monitoring, detection and mitigation actions, concentrating on the benefits brought by the developed mechanisms for improving the management within SDN/NFV environment. A careful attention will be given not only to the immediate benefits of the testbeds in their specific active domains (e.g. IMS, EPC, firewalls, etc.), but also what would be the complexity to port the technological advancements to other of the domains.

Page 88: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 88 of 90

Glossary, Acronyms and Definitions 5G 5th generation mobile networks

ACL Access Control List

ADE Anomaly Detection Engine

ANN Artificial Neural Network

API Application Programming Interface

CSE CogNet Smart Engine

DDoS Distributed Denial of Service

DNS Domain Name System

DoS Denial of Service

DSE Distributed Security Enablement

ENISA European Union Agency for Network and Information Security

EPC Evolved Packet Core

HTTP Hypertext Transfer Protocol

ICMP Internet Control Message Protocol

ID3 Iterative Dichotomiser 3

ISP Internet Service Provider

LB Load Balancer

LSSVM Latent Structural Support Vector Machine

LSTM Long Short Term Memory

MANO NFV Management & Orchestration

MCDT Multiple Criteria Decision Table

ML Machine Learning

NAS Network Access Server

NF Network Function

NFV Network Function Virtualization

NFVM NFV Management

NFVO NFV Orchestrator

OFS OpenFlow Switch

OpenTSDB Open Timeseries Database

OPNFV Open Platform for NFV

Page 89: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 89 of 90

P-CSCF Proxy - Call Session Control Function

RADIUS Remote Authentication Dial-In User Service

RFI Radio Frequency Interference

RMSE Root Mean Square Error

SDN Software Defined Networks

SECaaS Security as a Service

SFC Service Function Chain

SFP Service Function Path

SLA Service Level Agreement

SLO Service Level Objective

SP Service Providers

SPAM Unsolicited email

SQL Structured Query Language

SYN Synchronize message to establish TCP connection

TLS Transport Layer Security

UE User Equipment

VM Virtual Machine

VNF Virtual Network Function

VNFC Virtual Network Function Controller

XML Extensible Markup Language

XSS Cross-site Scripting

Page 90: D5.3 – Engineering Release 2 - CogNet...D5.3 – Engineering Release 2 (Secure NFV Subsystem, High Availability Framework, Degradation Detection and Correction, Autonomic Rules Generator

D5.3 – Network Security and Resilience – Engineering Release 2

CogNet Version 0.2 Page 90 of 90

References

[1] http://www.malware-traffic-analysis.net/training-exercises.html [2] http://sectools.org/tool/hydra/ [3] http://h.foofus.net/?page_id=51 [4] https://nmap.org/ncrack/ [5] http://www.micheloosterhof.com/cowrie/ [6] https://www.enisa.europa.eu/topics/trainings-for-cybersecurity-specialists/online-training-

material/technical-operational [7] https://www.elastic.co/products/kibana [8] Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. “Isolation-based anomaly detection.” ACM

Transactions on Knowledge Discovery from Data (TKDD) 6.1 (2012): 3 [9] Chun-Nam John Yu and Thorsten Joachims, "Learning structural SVMs with latent variables.,"

in The 26th Annual International Conference on Machine Learning, New York, 2009, pp. 1169–1176.

[10] Usama M. Fayyad and Keki B. Irani, "Multi-interval discretization of continuous valued attributes for classification learning," in Thirteenth International Joint Conference on Artificial Intelligence, 1993, pp. 1022-1027.

[11] Rezende Fernandes Eraldo, Nogueira dos Santos C ı́cero, and Luiz Milidiu Ruy, "Latent trees for coreference resolution," Computational Linguistics, vol. 40(4), pp. 801–835.

[12] Iryna Haponchyk and Alessandro Moschitti, "A Practical Perspective on Latent Structured Prediction for Coreference Resolution," in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 2017, pp. 143-149.

[13] Rezende Fernandes Eraldo , Nogueira dos Santos C ı́cero, and Luiz Milidiu ́ Ruy, "Latent structure perceptron with feature induction for unrestricted coreference resolution," in Joint Conference on EMNLP and CoNLL - Shared Task, 2012, pp. 41-48.