Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Open Infrastructure Summit 2019
Demonstrating At Scale Monitoring Of
OpenStack Cloud Using Prometheus
Anandeep [email protected]
Pradeep [email protected]
1
●
●
●
●
●
●
3
Definitions
4
○○
○○○
7
Implications for Open Infrastructure
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Critical Monitoring Features
● Portability across different footprints● HA, scaling, persistence available for free ● Re-use platform capabilities - eg. Prometheus
● Users integrate for capabilities they want● Stringent SLAs can be met ● Plug-in different OSS components with the same API
● For each API, SLAs achieved can be optimized ○ E.g Fault management uses message bus directly
● Metrics meta-data and declarative metrics for every component, so metrics can be incorporated automatically
● Data sensing, collection and processing○ Either, some or all processed at the Edge
● Centralized access to reports, alerts
● Integration with Analytics
Service Assurance Framework Architecture
Architecture OverviewOn-site infrastructure platform
○■
○■
■○
■
Dispatch Routing Message Distribution Bus (AMQP 1.0)
kern
elnetcpu mem
hardware
syslog /proc pid
VM
VM
VM
MetricsEvents
Application Components (VM, Container);
Controller, Compute, Ceph, RHEV, OpenShift Nodes (All Infrastructure Nodes)
3rd Party IntegrationsPrometheus Operator
MGMT Cluster APIs
Prometheus-based K8S Monitoring
● Collectd container -- Host / VM metrics collection framework○ Collectd 5.8 with additional OPNFV Barometer specific plugins not
yet in collectd project● Intel RDT, Intel PMU, IPMI● AMQP1.0 client plugin● Procevent -- Process state changes● Sysevent -- Match syslog for critical errors● Connectivity -- Fast detection of interface link status
changes○ Integrated as part of TripleO (OSP Director)
write_syslogwrite_kafkawrite_prometheusamqp_09amqp1
AMQ 7 Interconnect - Native AMQP 1.0 Message Router
● Large Scale Message Networks○ Offers shortest path (least cost) message routing○ Used without broker○ High Availability through redundant path topology and
re-route (not clustering)○ Automatic recovery from network partitioning failures○ Reliable delivery without requiring storage
● QDR Router Functionality○ Apache Qpid Dispatch Router QDR ○ Dynamically learn addresses of messaging endpoints○ Stateless - no message queuing, end-to-end transfer
Server A
Server BClient
Client
Client
Server C
High Throughput, Low LatencyLow Operational Costs
●
●
●
●
●
Prometheus Operator
●
○
○
○
●
○
○
●
○
Evolution
AMQP OS Networks
cephcntrl 1cntrl 2cntrl 3
Prometheus Operator++ Cluster
Prometheus
Grafana
QDR QDR QDR
SG
SG
SG
Central Site
Remote Site(s)
cephceph
ceph
computecomputecompute
compute
AMQP OS Networks
cephcntrl 1cntrl 2cntrl 3
cephceph
ceph
computecomputecompute
compute
AMQP OS Networks
cephcntrl 1cntrl 2cntrl 3
cephceph
ceph
computecomputecompute
compute
AMQP OS Networks
cephcntrl 1cntrl 2cntrl 3
cephceph
ceph
computecomputecompute
compute
Layer 3 Network to Remote Sites
Site 1 Site 2 Site 10
DCN Use Case
L3 Routed
Controller Nodes
OPTIONAL
AZ0
Compute Nodes(Local Ephemeral)
Undercloud+Container Registry
Ceph Cluster 0
OPTIONAL
Primary Site
DCN Site 1
AZ1
Compute Nodes(Local Ephemeral)
DCN Site 2
AZ2
Compute Nodes(Local Ephemeral)
DCN Site 3
AZ3
Compute Nodes(Local Ephemeral)
DCN Site 4
AZ4
Compute Nodes(Local Ephemeral)
DCN Site n
AZn
Compute Nodes(Local Ephemeral)
AZ0
Deployment Stack
Configuration & Deployment
● Collectd and QDR profiles are integrated as part of the TripleO
● Collectd and QDRs run as containers on the openstack nodes
● Configured via heat environment file
● Each node will have a qpid dispatch router running with collectd agent
● Collectd is configured to talk to qpid dispatch router and send metrics and events
● Relevant collectd plugins can be configured via the heat template file
TripleO Integration Of client side components
## This environment template to enable Service Assurance Client side bitsresource_registry:
OS::TripleO::Services::MetricsQdr: ../docker/services/metrics/qdr.yaml OS::TripleO::Services::Collectd: ../docker/services/metrics/collectd.yaml
parameter_defaults: CollectdConnectionType: amqp1 CollectdAmqpInstances: notify: notify: true format: JSON presettle: true telemetry: format: JSON presettle: false
TripleO Client side Configurationenvironments/metrics-collectd-qdr.yaml
cat > params.yaml <<EOF---parameter_defaults: CollectdConnectionType: amqp1 CollectdAmqpInstances:
telemetry:format: JSONpresettle: true
MetricsQdrConnectors:- host: qdr-white-normal-sa-telemetry.apps.dev7.nfvpe.site port: 443 role: edge sslProfile: tlsProfile verifyHostname: false
EOF
TripleO Client side Configurationparams.yaml
cd ~/tripleo-heat-templatesgit checkout mastercd ~cp overcloud-deploy.sh overcloud-deploy-overcloud.shsed -i 's/usr\/share\/openstack-/home\/stack\//g' overcloud-deploy-overcloud.sh./overcloud-deploy-overcloud.sh -e /usr/share/openstack-tripleo-heat-templates/environments/metrics-collectd-qdr.yaml -e /home/stack/params.yaml
Client side DeploymentUsing overcloud deploy with collectd & qdr configuration and environment templates
●
●
●
●
There are 3 core components to the telemetry framework:
● Prometheus (and the AlertManager)
● Smart Gateway
● QPID Dispatch Router
Each of these components has a corresponding Operator that we'll use to spin up the various application components and objects.
To deploy telemetry framework from the script, simply run the following command after cloning the telemetry-framework repo[1] into the following directory.
cd ~/src/github.com/redhat-service-assurance/telemetry-framework/deploy/
./deploy.sh CREATE
[1] https://github.com/redhat-service-assurance/telemetry-framework
Operators Custom Resources
Service Assurance Framework
Deploying Service Assurance FrameworkFrom Operator to Application
Demo
avg_over_time(sa_collectd_cpu_percent{type=~”system|user”}[1m] ) > 75 and avg_over_time(sa_collectd_cpu_percent{type=~”system|user”}[1m] ) < 90
Critical CPU Usage Alert:avg_over_time(sa_collectd_cpu_percent{type=~”system|user”}[1m] ) > 90
Architecture Demo Service Assurance framework
● https://telemetry-framework.readthedocs.io/en/master/
●
https://quay.io/repository/redhat-service-assurance/smart-gateway-operator?tab=info
● https://github.com/redhat-service-assurance
●
○
○
○
○
○
○
○
○
○
○
○
○
○
○
○
○
○
●
●
●
●
●
●
●
Target /Metrics
Target /MetricsPrometheus Server
PromQLHTTP
HTTP
Visualize
●
●
●
●
●