47
Open Infrastructure Summit 2019 Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus Anandeep Pannu [email protected] Pradeep Kilambi [email protected] 1

OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

Open Infrastructure Summit 2019

Demonstrating At Scale Monitoring Of

OpenStack Cloud Using Prometheus

Anandeep [email protected]

Pradeep [email protected]

1

Page 2: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

Page 3: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

3

Definitions

Page 4: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

4

Page 5: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

○○

○○○

Page 6: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus
Page 7: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

7

Implications for Open Infrastructure

Page 8: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

Page 9: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

Page 10: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

Critical Monitoring Features

Page 11: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

● Portability across different footprints● HA, scaling, persistence available for free ● Re-use platform capabilities - eg. Prometheus

● Users integrate for capabilities they want● Stringent SLAs can be met ● Plug-in different OSS components with the same API

● For each API, SLAs achieved can be optimized ○ E.g Fault management uses message bus directly

● Metrics meta-data and declarative metrics for every component, so metrics can be incorporated automatically

● Data sensing, collection and processing○ Either, some or all processed at the Edge

● Centralized access to reports, alerts

● Integration with Analytics

Page 12: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

Service Assurance Framework Architecture

Page 13: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

Architecture OverviewOn-site infrastructure platform

Page 14: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

○■

○■

■○

Page 15: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

Dispatch Routing Message Distribution Bus (AMQP 1.0)

kern

elnetcpu mem

hardware

syslog /proc pid

VM

VM

VM

MetricsEvents

Application Components (VM, Container);

Controller, Compute, Ceph, RHEV, OpenShift Nodes (All Infrastructure Nodes)

3rd Party IntegrationsPrometheus Operator

MGMT Cluster APIs

Prometheus-based K8S Monitoring

Page 16: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

● Collectd container -- Host / VM metrics collection framework○ Collectd 5.8 with additional OPNFV Barometer specific plugins not

yet in collectd project● Intel RDT, Intel PMU, IPMI● AMQP1.0 client plugin● Procevent -- Process state changes● Sysevent -- Match syslog for critical errors● Connectivity -- Fast detection of interface link status

changes○ Integrated as part of TripleO (OSP Director)

Page 17: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

write_syslogwrite_kafkawrite_prometheusamqp_09amqp1

Page 18: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

AMQ 7 Interconnect - Native AMQP 1.0 Message Router

● Large Scale Message Networks○ Offers shortest path (least cost) message routing○ Used without broker○ High Availability through redundant path topology and

re-route (not clustering)○ Automatic recovery from network partitioning failures○ Reliable delivery without requiring storage

● QDR Router Functionality○ Apache Qpid Dispatch Router QDR ○ Dynamically learn addresses of messaging endpoints○ Stateless - no message queuing, end-to-end transfer

Server A

Server BClient

Client

Client

Server C

High Throughput, Low LatencyLow Operational Costs

Page 19: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

Prometheus Operator

Page 20: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

Page 21: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

Evolution

Page 22: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

AMQP OS Networks

cephcntrl 1cntrl 2cntrl 3

Prometheus Operator++ Cluster

Prometheus

Grafana

QDR QDR QDR

SG

SG

SG

Central Site

Remote Site(s)

cephceph

ceph

computecomputecompute

compute

AMQP OS Networks

cephcntrl 1cntrl 2cntrl 3

cephceph

ceph

computecomputecompute

compute

AMQP OS Networks

cephcntrl 1cntrl 2cntrl 3

cephceph

ceph

computecomputecompute

compute

AMQP OS Networks

cephcntrl 1cntrl 2cntrl 3

cephceph

ceph

computecomputecompute

compute

Layer 3 Network to Remote Sites

Site 1 Site 2 Site 10

Page 23: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

DCN Use Case

L3 Routed

Controller Nodes

OPTIONAL

AZ0

Compute Nodes(Local Ephemeral)

Undercloud+Container Registry

Ceph Cluster 0

OPTIONAL

Primary Site

DCN Site 1

AZ1

Compute Nodes(Local Ephemeral)

DCN Site 2

AZ2

Compute Nodes(Local Ephemeral)

DCN Site 3

AZ3

Compute Nodes(Local Ephemeral)

DCN Site 4

AZ4

Compute Nodes(Local Ephemeral)

DCN Site n

AZn

Compute Nodes(Local Ephemeral)

AZ0

Deployment Stack

Page 24: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus
Page 25: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

Configuration & Deployment

Page 26: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

● Collectd and QDR profiles are integrated as part of the TripleO

● Collectd and QDRs run as containers on the openstack nodes

● Configured via heat environment file

● Each node will have a qpid dispatch router running with collectd agent

● Collectd is configured to talk to qpid dispatch router and send metrics and events

● Relevant collectd plugins can be configured via the heat template file

TripleO Integration Of client side components

Page 27: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

## This environment template to enable Service Assurance Client side bitsresource_registry:

OS::TripleO::Services::MetricsQdr: ../docker/services/metrics/qdr.yaml OS::TripleO::Services::Collectd: ../docker/services/metrics/collectd.yaml

parameter_defaults: CollectdConnectionType: amqp1 CollectdAmqpInstances: notify: notify: true format: JSON presettle: true telemetry: format: JSON presettle: false

TripleO Client side Configurationenvironments/metrics-collectd-qdr.yaml

Page 28: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

cat > params.yaml <<EOF---parameter_defaults: CollectdConnectionType: amqp1 CollectdAmqpInstances:

telemetry:format: JSONpresettle: true

MetricsQdrConnectors:- host: qdr-white-normal-sa-telemetry.apps.dev7.nfvpe.site port: 443 role: edge sslProfile: tlsProfile verifyHostname: false

EOF

TripleO Client side Configurationparams.yaml

Page 29: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

cd ~/tripleo-heat-templatesgit checkout mastercd ~cp overcloud-deploy.sh overcloud-deploy-overcloud.shsed -i 's/usr\/share\/openstack-/home\/stack\//g' overcloud-deploy-overcloud.sh./overcloud-deploy-overcloud.sh -e /usr/share/openstack-tripleo-heat-templates/environments/metrics-collectd-qdr.yaml -e /home/stack/params.yaml

Client side DeploymentUsing overcloud deploy with collectd & qdr configuration and environment templates

Page 30: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

Page 31: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

There are 3 core components to the telemetry framework:

● Prometheus (and the AlertManager)

● Smart Gateway

● QPID Dispatch Router

Each of these components has a corresponding Operator that we'll use to spin up the various application components and objects.

Page 32: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

To deploy telemetry framework from the script, simply run the following command after cloning the telemetry-framework repo[1] into the following directory.

cd ~/src/github.com/redhat-service-assurance/telemetry-framework/deploy/

./deploy.sh CREATE

[1] https://github.com/redhat-service-assurance/telemetry-framework

Page 33: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

Operators Custom Resources

Service Assurance Framework

Deploying Service Assurance FrameworkFrom Operator to Application

Page 34: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus
Page 35: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

Demo

Page 36: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

avg_over_time(sa_collectd_cpu_percent{type=~”system|user”}[1m] ) > 75 and avg_over_time(sa_collectd_cpu_percent{type=~”system|user”}[1m] ) < 90

Critical CPU Usage Alert:avg_over_time(sa_collectd_cpu_percent{type=~”system|user”}[1m] ) > 90

Page 37: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

Architecture Demo Service Assurance framework

Page 38: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

● https://telemetry-framework.readthedocs.io/en/master/

https://quay.io/repository/redhat-service-assurance/smart-gateway-operator?tab=info

● https://github.com/redhat-service-assurance

Page 39: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus
Page 40: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

Page 41: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

Page 42: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

Page 43: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

Page 44: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

Page 45: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus
Page 46: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus

Target /Metrics

Target /MetricsPrometheus Server

PromQLHTTP

HTTP

Visualize

Page 47: OpenStack Cloud Using Prometheus Demonstrating At Scale Monitoring Of … · 2019-05-13 · Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus ... write_kafka write_prometheus