Troubleshooting App Health and Performance with PCF Metrics 1.2

Preview:

Citation preview

PCF Metrics – App Dev

Providing App Developers insight into app performance

PCF Metrics

Providing App Developers insight into app performance

Pieter Humphrey, Allen Duet

Gartner believes that more than 80% of all mission-critical IT service outages result

from people and process errors and failures, and of those outages, more than

50% result from a lack of coordination between change, release and configuration

management processes.

Four Steps to Optimize Configuration Management Process and Tools, By Ronni J. Colville, Doc #G00258557 Oct 2013

Modern infrastructure is constantly changingMethodologies Deployment

Sparingly at designated times

Ready for prod at any time

Architecture Technologies Operations

App Server on Machine

Containers, Public / Private /

Hybrid Cloud

Monolithic App

Microservices / Composite app

Linear / Sequential

AgileDevOps

CI / CD Pipelines

Many tools, ad hoc automation

Manage services,not servers

Rate of change is driving more outages

5

Outages often preventable using automation

Facebook1 hour, Jan 26th

Config / app / net failures

Apple App Store11 hours March 11th Internal DNS error

NYSE, United, WSJ4 hr, 1.5 hr, 1 hr July 8th Software update, routing failure, server overload

UltraDNS2.5 hours Oct 15th

Configuration Errorshttps://blog.thousandeyes.com/top-internet-outages-2015/http://www.informationweek.com/cloud/9-spectacular-cloud-computing-fails/d/d-id/1321305?image_number=2http://www.informationweek.com/cloud/9-spectacular-cloud-computing-fails/d/d-id/1321305?image_number=4http://www.informationweek.com/cloud/9-spectacular-cloud-computing-fails/d/d-id/1321305?image_number=8

2015

“25% of customers will abandon a web page that takes more than 4 seconds to load”

“47% of consumers expect a web page to load in < 2 seconds”

“Customers prefer competitors website if it is 250ms faster”

“Increase revenue 1% for each 100ms improvement”

Sources: Gartner, Google, Amazon, Walmart

6

Speed and Availability Matters

7

Speed Performance and Human Perception

Delay time

User Reaction

0 - 100 ms 100-300 ms 300-1000 ms 1 second + 10 seconds +

Instant Feels sluggish

Machine is working..

Mental context switch

I’ll come back later ..

Stay under 250 ms to feel "fast".Stay under 1000 ms to keep users attention.

Breaking the 1000 ms Mobile Barrier - Velocity - Google Slideshttps://docs.google.com/presentation/d/1wAxB5DPN-rcelwbGO6lCOus_S1rP24LMqA8m1eXEDRo/present?slide=id.p19

Changes to a single microservice or monolithic app can impact

performance of downstream apps and services, or cause breakage

8

9

Troubleshooting apps and microservices is hard

Most platforms have:Disparate permissions on different

appsData silos across subsystems

Trouble reconciling time series data

10

MultipleLanguages

MicroservicesSupport

ServicesMarketplace

NativeUser

Provided Partner

DEVELOPMENT

10

Operating System

Cloud API

Container Orchestration

App Deployment& Management

Availability

Visibility &Administration

CI/CD Tools,ID, Security

Health, Metrics,Patching

Apps & PlatformDashboards

OPERATIONS

11

4 Levels of High Availability

Availability Zone Fail

4

VM Fail

3

Process Fail

2

App Instance Fail

1

VM VM

Process

VM VM VM

VM VM

VM VM

VM VM

VM VM

12

Container Scheduler Handles Workloads

250,000 containers

managed in a single

environment

https://blog.pivotal.io/pivotal-cloud-foundry/products/250k-containers-in-production-a-real-test-for-the-real-world

13

Container Scheduler Handles Workloads

Dynamic load balancing

14

Container Scheduler Handles Workloads

Dynamic load balancing

Remediation and rebalance of workloads

15

Each Layer Upgradable with No Downtime

App Runtime*

File system mapping

Application

Linux host & kernel

Blue-Green deploy

Canary style deploy

* e.g. Embedded webserver, app configurations, JRE, agents for services packaged as buildpacks

C o n t a i n e r

Our CharterTo provide App Devs with data points to assess overall solution performance and healthProviding App Developers insight into app performance

• Near real-time view

• Covers 80-90% of the problems

• One tool correlates events, logs, metrics

• Common set of facts for Dev+Ops

• Designed for PCF multi-tenancy

• Agentless, no install

• Enabled automatically for all applications

Immediate Integrated Automated

Available Data

CF EVENTS

APP LOGS

APP METRICS

ROUTES

Select an app,watch streaming

data

2 weeks of app log storage2 weeks of detailed container and http start stop metric storageApp Log distribution histogramApp Event UI improvementsFault tolerance on all storage servicesTesting and tuning for large ingestion loads

v1.2.1 PCF Metrics

Data Correlation Demo

22

PCF Metrics 1.2 Architecture

Our Journey

PCF Metrics v1.0PCF Metrics v1.1

PCF Metrics v1.2.1PCF Metrics v1.3

Aggregate Container and HTTP metrics provided for Apps

Aggregate Container and HTTP metrics + App events and Logs

(24 hour storage)

Aggregate Container and HTTP metrics + App events and Logs

(2 weeks storage)

Aggregate Container and HTTP metrics + App events and Logs

(2 weeks storage)

TraceID capture and Trace Logs

Spring Boot actuator supportExpanded event descriptionsAdditional Log sources *Data exposed as APIContinued UX improvements

v1.3+ App Developers