97
Production Readiness Strategies in an Automated World

Production Readiness Strategies in an Automated World

Embed Size (px)

Citation preview

Page 1: Production Readiness Strategies in an Automated World

Production Readiness Strategies in an

Automated World

Page 2: Production Readiness Strategies in an Automated World

Sean ChittendenEngineering, HashiCorp@SeanChittendenhttps://keybase.io/seanc

Page 3: Production Readiness Strategies in an Automated World

Dev to Prod

Page 4: Production Readiness Strategies in an Automated World

Background

Page 5: Production Readiness Strategies in an Automated World

Software Life Cycle

Page 6: Production Readiness Strategies in an Automated World

Idea!

Software Life Cycle

Page 7: Production Readiness Strategies in an Automated World

Idea!

Software Life Cycle

Page 8: Production Readiness Strategies in an Automated World

Software Life Cycle

TimeProd

1) Idea!

R&D

Page 9: Production Readiness Strategies in an Automated World

Software Life Cycle

TimeProd

1) Idea!

2) Production ReadyR&

D

Page 10: Production Readiness Strategies in an Automated World

Software Life Cycle

TimeProd

1) Idea!

2) Production ReadyR&

D

Page 11: Production Readiness Strategies in an Automated World

Software Life Cycle

TimeProd

1) Idea!

2) Production ReadyR&

D

Page 12: Production Readiness Strategies in an Automated World

Software Life Cycle

TimeProd

1) Idea!

2) Production ReadyR&

D

Page 13: Production Readiness Strategies in an Automated World

Software Life Cycle

Time

Read

ines

s

1) Idea!

2) Production Ready 3) End of Life

2.9) "It’ll be time to wind this service downwhen ___ happens and ___ comes online."

R&D

Page 14: Production Readiness Strategies in an Automated World

Software Life Cycle

Time

Prod

uctio

n

1) Idea!

2) Production Ready

3) End of Life

"Production Supported"

"Oops"

R&D

Page 15: Production Readiness Strategies in an Automated World

Software Life Cycle

Time

Prod

uctio

n

1) Idea!

2) Production Ready

4) End of Life

"Production Supported"

3) "Oops"

R&D

Page 16: Production Readiness Strategies in an Automated World

Software Life Cycle

Time

Prod

uctio

n

1) Idea!

N) End of Life

"Production Supported"

Forced to fix code or docs.R&D

Page 17: Production Readiness Strategies in an Automated World

Software Life Cycle

Time

Prod

uctio

n

1) Idea!

2) Production Ready

N) End of Life

"Production Supported"

"Drug feet to produce docs."

[3,M) "Oops"

R&D

N-1) "That’s it, we’ve had enough…"

Page 18: Production Readiness Strategies in an Automated World

Software Life Cycle

Time

Prod

uctio

n

1) Idea!

2) Production Ready

N) End of Life

"Production Supported"

[3,M) "Oops"

R&D

N-2) "That’s it, we’ve had enough…"

N-1) "Just support it untilthe next version is out"

Page 19: Production Readiness Strategies in an Automated World

Operations in the "Real World"

Page 20: Production Readiness Strategies in an Automated World

Complexity AboundThe Echo Service: Stateless HTTP Echo

$ go get github.com/hashicorp/http-echo$ http-echo -text foo$ curl http://127.0.0.1:5678/foo

Page 21: Production Readiness Strategies in an Automated World

Echo as a Service

Components:

• Echo Service

• Load Balancer

• "Hardware" / OS

• Metrics Agent

• Logs Management

• Reproducible Builds

$ cd $GOPATH/src/github.com/hashicorp/http-echo/$ git checkout 87ee38c517094993932bd76b37af03980e8c4151$ go build

Page 22: Production Readiness Strategies in an Automated World

Complexity In The Simple Case

Simple Example: The Echo Service

Minimum of 6x dimensions to be concerned about

No downstream services: only request + response

Page 23: Production Readiness Strategies in an Automated World

Echo as a Service

Dimensions of Work to measure:

• CPU

• RAM usage

• Network Usage

• TCP accept/connection rate

• Disk Capacity

• Disk IO (maybe?)

• Stability

• Request volume

• Request Latency

Page 24: Production Readiness Strategies in an Automated World

"Can't Escape the Signal, Mal"The Echo Service: Stateless HTTP Echo

2016/11/18 03:29:58 Server is listening on :56782016/11/18 03:30:00 127.0.0.1:5678 127.0.0.1:61932 "GET / HTTP/1.1" 200 4 "curl/7.51.0" 15.94µs

Page 25: Production Readiness Strategies in an Automated World

Echo as a Service

Complexity Factor: ~10

Page 26: Production Readiness Strategies in an Automated World

Echo's Operational ConcernsLoss Aversion

• Uptime

• Secrets

• Planned Failure Modes: failure on a probability curve

• Server Uptime (e.g. OS or Hardware)

• Unplanned Failure Modes (e.g. DC or AZ fails)

Page 27: Production Readiness Strategies in an Automated World

Entropy and Failure: Best Friends

Page 28: Production Readiness Strategies in an Automated World

Echo's Operational ConcernsLoss Aversion

• Uptime

• Secrets

• Planned Failure Modes: failure on a probability curve

• Server Uptime (e.g. OS or Hardware)

• Unplanned Failure Modes (e.g. DC or AZ fails in an earthquake)

• Success Failure Modes

Randall A. Lewis and David H. Reiley. 2013. Down-to-the-minute effects of super bowl advertising on online search behavior. http://dx.doi.org/10.1145/2482540.2482600

Page 29: Production Readiness Strategies in an Automated World

Echo's Operational ConcernsLoss Aversion

• Uptime

• Secrets

• Planned Failure Modes: failure on a probability curve

• Server Uptime (e.g. OS or Hardware)

• Unplanned Failure Modes (e.g. DC or AZ fails)

• Success Failure Modes

• Known Architectural Limits

• Unknown Architectural Limits

Page 30: Production Readiness Strategies in an Automated World

Performance Spelunking

Exciting, but not very fun

Page 31: Production Readiness Strategies in an Automated World

Lurking Significant DetailsImagine a more complex service:

• an API server that fans out to ~20 downstream services

• Uses async scatter/gather to fan out requests

• Transient failures become the norm

Page 32: Production Readiness Strategies in an Automated World

Stateful ComplexityDatabase-as-a-Service: PostgreSQL Edition

Page 33: Production Readiness Strategies in an Automated World

SQLWAL Files Log Files

PostgreSQL as a Service

Components:

• PostgreSQL

• Connection Pooler (pgbouncer)

• PITR Manager (WAL-E, omnipitr, pgBackRest)

• Logs Analyzer (pgbadger, pgfouine)

• Metrics Agent

• Failover Manager (Connections, State, Data Continuity/Self-Healing)

• Schema Versioning

Page 34: Production Readiness Strategies in an Automated World

SQLWAL Files Log Files

PostgreSQL as a Service

Dimensions of Work to measure:

• CPU

• RAM usage

• Network Usage

• TCP accept/connection rate

• Disk Capacity

• Maybe disk IO (read, write)

• Stability

• Request volume

• Request Latency

• Query performance

• Kernel Lock Contention

• Userland buffer eviction rate

• Cache-miss rate

• Size of blast radius

• ... etc.

Page 35: Production Readiness Strategies in an Automated World

SQLWAL Files Log Files

PostgreSQL as a Service

Complexity Factor:

~30 x (number of tables x metrics per table)

Page 36: Production Readiness Strategies in an Automated World

SQLWAL Files Log Files

PostgreSQL as a Service

Database PSA Tangent:

• Don't confuse complexity with value.

• Databases are amazingly useful things because of their productivity and value as a network service.

• Databases assume the lions share of complexity burden: centralized complexity is easier than distributed complexity.

Page 37: Production Readiness Strategies in an Automated World

How do you systematically address inherent,

necessary complexity?

Page 38: Production Readiness Strategies in an Automated World

Checklists

• Identify Problems

• Read - Do Checklists

• Ensure critical steps hit

• Useful in emergencies (plane on fire? Do X, Y, and Z...)

• Do - Confirm Checklists

• Verify muscle memory

• Combats atrophy and fatigue

Page 39: Production Readiness Strategies in an Automated World

Building a Modern Operations Checklist

Page 40: Production Readiness Strategies in an Automated World

Who uses checklists?

Astronauts

Surgeons

Pilots

Inspectors

Military

IT/Operations?

Page 41: Production Readiness Strategies in an Automated World

Good Checklists• Have a clear purpose

• Are brief: 10-20 items, fit on a single page

• Focus on what's essential/mandatory

• Enumerate what must be done (and frequently forgotten)

• Don't replace personal judgement or skill

• Enforce discipline

• Provide tools for collaboration and communication

• Establish protocol or enforce a norm

Page 42: Production Readiness Strategies in an Automated World

Good Checklists• Have a clear purpose

• Are brief: 10-20 items, fit on a single page

• Focus on what's essential/mandatory

• Enumerate what must be done (and frequently forgotten)

• Don't replace personal judgement

• Enforce discipline

• Provide tools for collaboration and communication

• Establish protocol or enforce a norm

Page 43: Production Readiness Strategies in an Automated World

Building a Modern Operations

Checkli^WAudit

Page 44: Production Readiness Strategies in an Automated World

Production Ready

SQLWAL Files Log Files

Page 45: Production Readiness Strategies in an Automated World

Production Ready

SQLWAL Files Log Files

Organizational Challenges Technical Challenges

Page 46: Production Readiness Strategies in an Automated World

Organizational PrerequisitesStandardized Jargon (e.g. SEV1 vs SEV2, client vs consumer)

Policy for Unique Service namespaces (app1 vs appN vs dbN)

# Deny registration access to services prefixed# "app1-". Discovery of the service is still# allowed in read mode.service "app1-" { policy = "read"} service "app2-" {

policy = "write"}

Page 47: Production Readiness Strategies in an Automated World

Organizational PrerequisitesStandardized Jargon (e.g. SEV1 vs SEV2, client vs consumer)

Policy for Unique Service namespaces (app1 vs appN vs dbN)

Naming conventions established within a service (app1-api1 vs app1-dbN)

Rules of Engagement outlining how outage is:

1. Identified

2. Responded to

3. Recovery is conducted

4. Prevention

5. Preparation

6. GOTO step #1

Page 48: Production Readiness Strategies in an Automated World

Organizational PrerequisitesStandardized Jargon (e.g. SEV1 vs SEV2, client vs consumer)

Policy for Unique Service namespaces (app1 vs appN vs dbN)

Naming conventions established within a service (app1-api1 vs app1-dbN)

Rules of Engagement outlining how outage is handled

Centralized documentation

Establish a culture of systems thinking

Page 49: Production Readiness Strategies in an Automated World

Organizational PrerequisitesEstablish a culture of systems thinking:

•a system is composed of parts

•a system is greater than the sum of its parts

•all the parts of a system must be related (directly or indirectly), else there are really two or more distinct systems

•a system is encapsulated (has a boundary)

•a system can be nested inside another system

•a system can overlap with another system

•a system consists of processes that transform inputs into outputs

•a system is autonomous in fulfilling its purpose:A car is not a system. A car with a driver is a system.

Page 50: Production Readiness Strategies in an Automated World

Organizational PrerequisitesStandardized Jargon (e.g. SEV1 vs SEV2, client vs consumer)

Policy for Unique Service namespaces (app1 vs appN vs dbN)

Naming conventions established within a service (app1-api1 vs app1-dbN)

Rules of Engagement outlining how outage is handled

Centralized documentation

Establish a culture of Systems Thinking

Establish end-to-end ownership

Decoupled service names from team names

Page 51: Production Readiness Strategies in an Automated World

Why do we care?• We aren't always going to be working on our code.

• We need to establish a culture of maintenance and the necessary supporting systems, both organizational and technical.

Page 52: Production Readiness Strategies in an Automated World

Audit Reduced to a ChecklistHigh-level summary of the service?

Stateful or Stateless

List of important consumers

Release Process

On-Call Instructions / Incident Response

Health Defined

Customer Service Endpoint?

Backups

Geographic Redundancy

Page 53: Production Readiness Strategies in an Automated World

Audit back to ChecklistHigh-level summary of the service?

Stateful or Stateless

List of important consumers

Release Process

On-Call Instructions / Incident Response

Health Defined

Customer Service Endpoint?

Backups

Geographic Redundancy

=> Organizational Concern

=> Technical Concern

=> Tech and Org Concern

=> Organizational Concern

=> Organizational Concern

=> Technical Concern

=> Organizational Concern

=> Organizational Concern

=> Organizational Concern

Page 54: Production Readiness Strategies in an Automated World

Plan, Doc, Vet, and Decide Starting Here...

TimeProd

1) Idea!

2) Production ReadyR&

D

Page 55: Production Readiness Strategies in an Automated World

... ideally before here...

Time

Prod

uctio

n

1) Idea!

N) End of Life

"Production Supported"

Forced to fix code or docs.R&D

Page 56: Production Readiness Strategies in an Automated World

... but NO later than here!!!

Time

Prod

uctio

n

1) Idea!

N) End of Life

"Production Supported"

Forced to fix code or docs.R&D

Page 57: Production Readiness Strategies in an Automated World

(It's good to refine here when this happens)

Time

Prod

uctio

n

1) Idea!

N) End of Life

"Production Supported"

Forced to fix code or docs.R&D

Page 58: Production Readiness Strategies in an Automated World

Value from ChecklistsHigh-level summary of the service?

Stateful or Stateless

List of important consumers

Release Process

On-Call Instructions / Incident Response

Health Defined

Customer Service Endpoint?

Backups

Geographic Redundancy

=> Faster Training / Fungible Skills

=> Universal / Consistent / Standard

=> Faster Understanding and Training

=> Faster Resolution / Fungible Skills

=> Larger Pool / Increased Sympathy

=> Standardized Resolution

=> One Source of Truth

=> Standard Procedures

=> Unplanned Disasters Mitigated

Page 59: Production Readiness Strategies in an Automated World

How do you build a checklist?

Page 60: Production Readiness Strategies in an Automated World

Summary: Vertical Places to Look

SQLWAL Files Log Files

Organizational Challenges Technical Challenges

Page 61: Production Readiness Strategies in an Automated World

Summary: Horizontal Places to Look

TimeProd

1) Idea!

2) Production ReadyR&

D

Page 62: Production Readiness Strategies in an Automated World

Questions?

Thank the audience for their time.

Name: Sean Chittenden

Twitter : @SeanChittenden

Page 63: Production Readiness Strategies in an Automated World

Recommended Reading

Page 64: Production Readiness Strategies in an Automated World

Seed Questions for Checklists

Page 65: Production Readiness Strategies in an Automated World

Service Checklist: OverviewService Overview

• Description and relevance to the business

• Short explanation of how the service fits into the eco system of micro services

• Pointers to more detailed documentation

• Pointers to the current team owners

Stateful or Stateless service

Does the service employ any internal caching

Dependency management: e.g. embedded libraries that have been vendor/'ed (not necessary with Go, this is self-evident)

Page 66: Production Readiness Strategies in an Automated World

Service Overview

$ head my-service.job# This declares a job named "service123". There can be exactly one# job declaration per job file.job "service123" { # Specify this job should run in the region named "us". Regions # are defined by the Nomad servers' configuration. region = "us"

# Spread the tasks in this job between us-west-2 and us-east-1. datacenters = ["us-west-2", "us-east-1"]

# Run this job as a "service" type. Each job type has different # properties. See the documentation below for more examples. type = "service"

Service Checklist: Overview

Page 67: Production Readiness Strategies in an Automated World

Service Overview

$ head my-docs.job# This declares a job named "docs". There can be exactly one# job declaration per job file.job "docs" { meta { owner = "https://github.com/myorg/myproject/blob/master/owners.md" docs-url = "https://github.com/myorg/myproject" system-summary = "https://github.com/myorg/myproject/blob/master/system-summary.md" }

Service Checklist: Overview

Page 68: Production Readiness Strategies in an Automated World

Service Overview

• Auditable via the API:http://nomad.service.consul:4646/v1/job/<ID>

Service Checklist: Overview

Page 69: Production Readiness Strategies in an Automated World

List of high-level consumers

• API consumed by other services within the organization

• Public Internet

• Marketing (a/b testing?)

• Customer Service

Service Confidentiality Classification

Sales Information

• Unofficial docs that can be used by sales or marketing. Authoritative information comes from the team writing the service. Doesn't need to be final copy, but should include useful figures about this service.

Service Checklist: Overview

Page 70: Production Readiness Strategies in an Automated World

Release Process

On-call - what's the fallback strategy for a small service with a team of two?

How is the service installed?

How is the service configured?

How is the service's process managed?

• How is it started?

• How is it stopped?

• Is there a graceful shutdown procedure vs a rapid shutdown procedure?

• Can you send a SIGKILL signal to the process?

Incident Response

Page 71: Production Readiness Strategies in an Automated World

Release Process

On-call - what's the fallback strategy for a small service with a team of two?

How is the service installed?

How is the service configured?

How is the service's process managed?

Is the process management platform-specific?

Is there a table mapping each signal to the effect of the signal

Process Management

Is Process Management hooked into the monitoring and alerting framework?

Incident Response

Page 72: Production Readiness Strategies in an Automated World

HealthHealth of the Service

What is the definition of healthy?TIP: Use Consul Health Checks for Break/Fix

{ "service": { "name": "redis", "tags": ["master"], "address": "127.0.0.1", "port": 8000, "enableTagOverride": false, "checks": [ { "script": "/usr/local/bin/check_redis.py", "interval": "10s" } ] }}

Page 73: Production Readiness Strategies in an Automated World

Health of the Service

What is the definition of healthy?

Is there any Seasonality to the definition of healthy?

How do you observe the service?

Is there any automated capacity planning attached to the service?

Health

Page 74: Production Readiness Strategies in an Automated World

Customer Service

How does customer service interact with this service?

Does CS have direct access to PII or other sensitive material?

Customer Service

Page 75: Production Readiness Strategies in an Automated World

Quality MetricsWhat are the important KPIs coming out of this service?

• If you don't measure it, you won't optimize for it.

• If you don't measure it, you can't manage it.

• You can only succeed at what you can measure.

• You can't improve what you don't measure.

Page 76: Production Readiness Strategies in an Automated World

Quality MetricsWhat are the important KPIs coming out of this service?

Measuring the number of round-trips between Support and Customers/Users

Measuring the number of round-trips between Support and Engineering

Measuring the "level of effort" or amount of input a person has to submit in order to receive support.

Accuracy of information provided by customers?

Measure the "rate of access" to PII information.

Page 77: Production Readiness Strategies in an Automated World

Quality MetricsWhat are the important KPIs coming out of this service?

Strategy: Centralize and poll for number of tagged issues out of GitHub.

Page 78: Production Readiness Strategies in an Automated World

Organization PrerequisitesDefine the gradients in an outage

• SEV1 - Hard outage, complete loss of service or "major impact to business value/revenue".

• SEV2 - Partial outage or impaired service (SLA violation).

• SEV3 - Integrity of service issue (bugs).

• SEV4 - Non-critical issue that needs to be prioritized 9-5 M-F.

• SEV5 - Janitorial work that needs to happen on a routine schedule.

Define what it means to follow through with an outage.

• What level of follow through is required?

• Postmortems?

• Who patches it and who receives time to actually fix it permanently?

Page 79: Production Readiness Strategies in an Automated World

Outage Consequences

Revenue Impact User Impact Systems Impact Escalation

SEV1

SEV2

SEV3

SEV4

SEV5

Page 80: Production Readiness Strategies in an Automated World

Outage ConsequencesDefine the gradients in an outage

Sketch out the direct and indirect consequences on the system

Page 81: Production Readiness Strategies in an Automated World

TracingIs there a tracing token sent by upstream? If not, why not?

Is this service at the boundary of HTTP and RPC?

Is there an API library available that will automatically inject the tracing token into downstream calls?

Can tracing only be used in aggregate or can it be used for individual problems?

Page 82: Production Readiness Strategies in an Automated World

Geographic RedundancyIs the service geographically redundant or not? If not, why not?

If yes:

Does this happen automatically?

Page 83: Production Readiness Strategies in an Automated World

Geographic Redundancy

{ "Name": "my-query", "Session": "adf4238a-882b-9ddc-4a9d-5b6758e4159e", "Token": "", "Near": "node1", "Service": { "Service": "redis", "Failover": { "NearestN": 3, "Datacenters": ["dc1", "dc2"] }, "OnlyPassing": false, "Tags": ["master", "!experimental"] }, "DNS": { "TTL": "10s" }}

Page 84: Production Readiness Strategies in an Automated World

Geographic RedundancyIs the service geographically redundant or not? If not, why not?

If yes:

Does this happen automatically?

What mechanisms handle this?

Are there any regulatory concerns that come into play?

Is the failover process manual?

Does this happen at human timescale or on a machine timescale?

Is the geographically redundant path continually tested?

Page 85: Production Readiness Strategies in an Automated World

Active-ActiveCan this service be active-active?

If not, why not?

If yes, what kind of locking concerns or information sharing concerns need to be factored in?

Page 86: Production Readiness Strategies in an Automated World

Data ClassificationDoes the service come in contact with any sensitive data?

If yes:

What type of data? (PII, passwords, keys, financial information, credit cards, ACH, etc.)

What regulatory compliance applicable to this service? (SafeHarbor, PCI, SOx?)

Is the data stored, or just passed in transit?

Can any sensitive data end up in log files?

Can sensitive, but necessary data use a proxy token instead?

Can this information leave the organization and goto a third party?

Page 87: Production Readiness Strategies in an Automated World

SPOFsWhat SPOFs exist, if any?

What's the timescale for this SPOF?

What's the timescale for transition from leader to follower or follower to leader?

If stateful, is "split brain" possible?

NOTE: State is a SPOF: failing over state takes time.

Page 88: Production Readiness Strategies in an Automated World

Escalation PathWhat's the escalation path inside of the organization?

What's the escalation path outside of the organization? Open Source community or commercial support?

Is there semi-regular training on how to triage and escalate?

Is there a playbook for relevant low-level debugging tools available for use?

TIP: Use automatic escalations within PagerDuty or OpsGenie.

TIP: Use standardized service techniques to create fungible support resources.

Page 89: Production Readiness Strategies in an Automated World

Quantiles of HealthCan health be defined in terms of quantiles vs binary up/down?

What are the upper and lower bounds for healthy?

What system is authoritative for determining if something is healthy?

How can an external actor verify if the system is healthy? Is there a command-line tool or API?

Page 90: Production Readiness Strategies in an Automated World

CanaryDoes the request have a "canary request mode?"

Can this be enabled per customer?

Is the canary mode used in monitoring to validate end-to-end functionality?

Page 91: Production Readiness Strategies in an Automated World

Downstream ServicesHow does this service respond upstream to failures in its downstream

dependencies?

Is there a metric to indicate timed-out requests?

Is there a feature-flag that enables a circuit-breaker?

How are connectivity problems retried in the system?

Retry the same backend?

Retry a different backend?

Timeout?

Is there a deadline timer passed in?

Is a header added to indicate partial failure of downstream services?

Are response codes standardized?

Page 92: Production Readiness Strategies in an Automated World

Architectural LimitsWhat are the expected limits of this system?

How often is "peak-load" defined?

Is there 3x capacity for the service in order to absorb reasonable bustiness?

Is the band of nominal resource usage defined?

• "At 10K RPS, network utilization should be between 200-300Mbps, using two cores at ~60% utilization, 50MB of RAM, and doing an average of 5-10 disk IOPs. All values are +/- 25%."

Page 93: Production Readiness Strategies in an Automated World

LoggingHow is logging setup?

What gets logged?

What is the minimum log retention?

How often are logs rotated? By size or by fixed interval?

Are logs shipped off box?

Are they streamed without hitting disk?

Is there any sensitive data in the logs?

Page 94: Production Readiness Strategies in an Automated World

Load SheddingHow can you load-shed?

Are there any feature flags that enable circuit breakers that reduce expensive functionality?

Page 95: Production Readiness Strategies in an Automated World

Prepare For the WorstAssume the service can't come back online, what's the impact?

Page 96: Production Readiness Strategies in an Automated World

Backup and RestoreDoes this system have a reproducible build?

How often are backups taken?

How often are the restores executed?

What's the recovery point objective?

What's the mean time to recovery?

What's the definition of acceptable data loss in the event of failure?

Page 97: Production Readiness Strategies in an Automated World

DeploymentHow is this service tested and deployed?

Is the deployment in prod any different than test?

How can you roll back?

Is the application part of a CI/CD pipeline?

How is production data scrubbed and used in staging/UAT in order to simulate production-like loads without using production data?