31
Faster, Higher, Stronger Accelerating Fault Management to Next Level

Faster, Higher, Stronger - OPNFV

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Faster, Higher, Stronger - OPNFV

Faster, Higher, StrongerAccelerating Fault Management to Next Level

Page 2: Faster, Higher, Stronger - OPNFV

Speakers

Carlos Gonçalves

Software Specialist on the 5G Networks team at NEC Laboratories Europe in Heidelberg, Germany.

He works in the areas of Network Functions Virtualization and Carrier-Cloud Operation & Management

Yujun Zhang

NFV System Engineer from ZTE Corporation

He is current PTL of QTIP in OPNFV, and creator of MitmStack in OpenStack

His main interest focuses on performance testing, analysis and tuning

Page 3: Faster, Higher, Stronger - OPNFV

Doctor project introduction

Doctor is fault management and maintenance project to develop and realize the consequent implementation for the OPNFV reference platform.

● Goals○ build fault management and maintenance framework

■ high availability of Network Services■ immediate notification of unavailability

○ requirement survey○ development of missing feature

● Scope: NFVI, VIM

Page 4: Faster, Higher, Stronger - OPNFV

Role of QTIP in the collaboration

QTIP is the project for "Platform Performance Benchmarking"

● Reveal details behind a simple indicator● Benchmarking of various testing environment and condition

Page 5: Faster, Higher, Stronger - OPNFV

Expected to learn

● How you can enable fast fault mitigation from a rich set of monitoring data sources

● How to fasten NFVI failure event to user● How to leverage performance profiler to find the

bottleneck

Page 6: Faster, Higher, Stronger - OPNFV

A “Strong” unbreakable mobile call

Page 7: Faster, Higher, Stronger - OPNFV

Faster! How?

Page 8: Faster, Higher, Stronger - OPNFV

Notification strategies: conservative

Page 9: Faster, Higher, Stronger - OPNFV

Notification strategies: shortcut

Page 10: Faster, Higher, Stronger - OPNFV

Notification strategies: pros and cons

Conservative+ Cloud resource states are

always up-to-date- Takes longer to report the

alarm out to consumers

Shortcut+ Faster notification to consumer- Cloud resource states could

still be out-of-sync by the time consumer processes the alarm notification

Consumer: User-side Manager; consumer of the interfaces produced by the VIM; VFNM, NFV-O or Orchestrator in ETSI NFV terminology

Page 11: Faster, Higher, Stronger - OPNFV

Notification times comparison (1/3) OpenStack Ocata (DevStack out of the box); 1x Controller, 1x Compute

Page 12: Faster, Higher, Stronger - OPNFV

Notification times comparison (2/3)Same deployment + Congress w/ notification capabilities (draft) & parallel execution driver support (cherry-picked from master)

Page 13: Faster, Higher, Stronger - OPNFV

Notification times comparison (3/3)- Sample outperforms Congress out of the box- Congress is much feature richer supporting dynamic user-defined policies and execution actions on most OpenStack cloud resources.

Page 14: Faster, Higher, Stronger - OPNFV

Issues and challenges

● Passed on Pod-A, but poor result on Pod-B○ Why such difference?

● Performance degradation when scaling up to more servers○ What is the bottleneck?

● Distributed services○ How to collect data from different nodes

Page 15: Faster, Higher, Stronger - OPNFV

Performance Profiler

Get more details behind Pass/Failure

Page 16: Faster, Higher, Stronger - OPNFV

Simple diagnostic

Page 17: Faster, Higher, Stronger - OPNFV

Want more details?

● check log files● check debugging messages ● ...

Page 18: Faster, Higher, Stronger - OPNFV

High technology equipments are helping Doctors in real world

Page 19: Faster, Higher, Stronger - OPNFV

Example in Chrome DevTool: Measuring Resource Loading Times

What does a profiler do

Page 20: Faster, Higher, Stronger - OPNFV

Craft a PoC

Page 21: Faster, Higher, Stronger - OPNFV

Profiler Poc

Page 22: Faster, Higher, Stronger - OPNFV

Still not enough

Inspiration fromOpenStack Summitpresentation

Page 23: Faster, Higher, Stronger - OPNFV

Now we know why

What’s behind `nova reset-state`

Page 24: Faster, Higher, Stronger - OPNFV

How osprofiler works

The implementation is quite simple. Profiler has one stack that contains ids of all trace points. E.g.:

profiler.start("parent_point") # trace_stack.push(<new_uuid>)

# send to collector -> trace_stack[-2:]

profiler.start("parent_point") # trace_stack.push(<new_uuid>)

# send to collector -> trace_stack[-2:]

profiler.stop() # send to collector -> trace_stack[-2:]

# trace_stack.pop()

profiler.stop() # send to collector -> trace_stack[-2:]

# trace_stack.pop()

Page 25: Faster, Higher, Stronger - OPNFV

Supported vs Needed

osprofiler doctor

CINDER

HEAT

KEYSTONE

NOVA

NEUTRON

GLANCE

TROVE

SENLIN

MAGNUM

CEILOMETER

VITRAGE

CONGRESS

AODH

Page 26: Faster, Higher, Stronger - OPNFV

Recommended to track by default

All HTTP calls - helps to get information about: what HTTP requests were done, duration of calls (latency of service), information about projects involved in request.

All RPC calls - helps to understand duration of parts of request related to different services in one project. This information is essential to understand which service produce the bottleneck.

All DB API calls - in some cases slow DB query can produce bottleneck. So it’s quite useful to track how much time request spend in DB layer.

All driver calls - in case of nova, cinder and others we have vendor drivers. Duration

ALL SQL requests (turned off by default, because it produce a lot of traffic)

Page 27: Faster, Higher, Stronger - OPNFV

Challenges in doctor use case

Doctor use case

● Composed by several consecutive steps

● Relies on events for fast notification

● Starts on monitor and ends in consumer

● Multi threaded in inspector

ASYNCHRONOUS

OSProfiler limitation

● Designed for profiling ONE request

● Event notification not tracked

● Must start and end in same thread

● Multi thread is not supported

SYNCHRONOUS

Page 28: Faster, Higher, Stronger - OPNFV

Gaps identified in upstream

osprofiler feature

[ ] multiple thread supporting

No support for osprofiler in Openstack services

[ ] alarming: aodh[ ] inspector: vitrage[ ] inspector: congress

Page 29: Faster, Higher, Stronger - OPNFV

Next in Euphrates

Page 30: Faster, Higher, Stronger - OPNFV

Roadmap: Doctor-QTIP collaboration

● [doctor] Integration of osprofiler in CI jobs● [doctor] Propose changes to upstream to fill gaps

○ Osprofiler enhancement○ Aodh supporting○ Congress supporting○ Vitrage supporting

● [qtip] Benchmarking of notification performance○ Collector backend for profiler data○ Dashboard for performance profile of last build

Page 31: Faster, Higher, Stronger - OPNFV

Questions?https://wiki.opnfv.org/display/doctor

https://wiki.opnfv.org/display/qtip