Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Self-Healing and Resilience in
Future 5G Cognitive Autonomous
Networks
26-28 November
Santa Fe, Argentina
J. Ali-Tolppa, S. Kocsis, B. Schultz, L. Bodrog, M. Kajo
Nokia Bell Labs
26-28 November
Santa Fe, Argentina
Resilience
• “An ability to recover from or adjust easily to misfortune or change” Merriam-Webster Dictionary
Robustness
• “Capability of performing without failure under a wide range of conditions ” Merriam-Webster Dictionary
26-28 November
Santa Fe, Argentina
Why is resiliency important in 5G?
• 5G is by nature dynamic and complex
→Unforeseen circumstances are bound to happen
• Use cases requiring ultra-high reliability (URLLC)
Robustness (redundancy etc.) is no longer alone
enough!
26-28 November
Santa Fe, Argentina
How to design for resilience?
• Monitor and adapt
• Decoupling,
modularity
Focus
Common core
principles
26-28 November
Santa Fe, Argentina
How to design for resilience?
• Monitor and adapt
• Decoupling,
modularity
Focus
Common core
principles
26-28 November
Santa Fe, Argentina
Self-Healing in Radio Access Networks
26-28 November
Santa Fe, Argentina
Detecting Anomalies without Labelled Training Data
Which are anomalous?
Example 1 Example 2 Example 3 Example 4
Meaningless quest ion Red Green
26-28 November
Santa Fe, Argentina
Anomaly Detection
Feature selection
Relevant feature:
Color Shape Color and shape
26-28 November
Santa Fe, Argentina
Radio Access Network Anomaly DetectionFeature and context selection
• Input features include typically Performance Management (PM) Key Performance Indicators (KPIs) and Fault Management (FM) alarms, but other (additional) inputs can be used as well, e.g. log analysis
• Is the whole input space profiled, including cross-correlations, or only selected projections of it (single KPIs, selected KPI pairs etc.)
• Context needs to decided, e.g. will the profiling be done per network function or a group of network functions, hourly, diurnal profiles for network traffic dependent KPIs etc.
• In our work we used PM data only and created diurnal profiles for traffic-dependent KPIs and cross-correlations for selected KPI pairs.
26-28 November
Santa Fe, Argentina
Radio Access Network Anomaly DetectionSimple time-context dependent profiling of a timeseries
26-28 November
Santa Fe, Argentina
Radio Access Network Anomaly DetectionCross-correlation profiling with clustering
• First a clustering algorithm is applied, which omits the most probable outliers to clarify data
• Correlation is modelled only inside the clusters
• Can model also non-normal multivariate distributions
26-28 November
Santa Fe, Argentina
DiagnosisAnomaly event detection and diagnosis
• Anomalous timeframes are detected by using DBSCAN algorithm on the anomaly levels of selected features against their profiles
• By aggregating the selected feature (KPI) values in the anomaly event timeframe, the event is represented as an anomaly pattern
– The diagnosis feature set can be, and often is, different than what is used in the detection!
• The root causes of the detected anomaly patterns are diagnosed against a diagnosis knowledgebase
Time
KPI1
KPI2
KPI3
anomalous t imeframe
average value (KPI1)
average value (KPI2)
average value (KPI3)
KPI1
KPI2
anomaly pattern
KPI3
26-28 November
Santa Fe, Argentina
DiagnosisActive learning assisted diagnosis
a) A human operator provides the machine with his own interpretation of the data
– By attaching labels to anomaly points or clusters while considering information from step b)
b) The machine provides the operator with a structured view of the data
– By clustering the data points while taking into account information from step a)
loop
structured view of the data
interpretation of the data
rethink restructure
26-28 November
Santa Fe, Argentina
Holistic Self-HealingAcross domains and management areas in mobile networks
Coordination is required between the self-healing actions of, for example:
• Network Management (NM): Management automation aggregated on a (Virtual) Network Function (V)NF level
• Quality of Experience (QoE) driven management: Optimizing the end-to-end customer experience at the
application and individual subscriber level
• VNF and Service Orchestration
In a complex system, improving the resilience of only one part or level of organization can sometimes
(unintentionally) introduce fragility in another. To improve the resilience, it is often necessary to work
in more than one domain and scale at a time. - A. Zolli, A. M. Healy, “Resilience – Why Things
Bounce Back”
26-28 November
Santa Fe, Argentina
Knowledge CloudTransferring diagnosis knowledge
• Collecting the diagnosis knowledge base is significant effort
• It would be desirable to be able to diagnose previously unforeseen problems
• This could be mitigated by sharing diagnosis knowledge between self-healing function deployments
• However, translating, i.e. generalizing and re-applying, diagnosis knowledge from other
deployments is a difficult problem
– Transfer learning methods
26-28 November
Santa Fe, Argentina
Demonstration in SON Experimental System
26-28 November
Santa Fe, Argentina
Evaluation Results
• Fault injection in a testbed• Radio attenuation• Backhaul misconfiguration
• Background traffic increased until the optimization functions can no longer remedy the problem
• Our solution detected and diagnosed both conditions before they lead to a service degradation
26-28 November
Santa Fe, Argentina
Conclusion
• In 5G, networks are becoming ever more complex and dynamic
• At the same time, new use cases are requiring increased reliability
• We need intelligent resilient networks that can react to unforeseen problems and adapt to changes in their context
• A step in this direction is the self-healing method presented in this paper, based on anomaly detection and diagnosis
• We need methods to share knowledge and coordinate the self-healing actions across management domains, areas and deployments– Standardized interfaces not only for sharing data, but also for sharing
knowledge and machine learning models
Thank you