The Selfish Stack [FutureStack16 NYC]

Preview:

Citation preview

The Selfish StackF u t u r e S t a c k – A u g u s t 2 0 1 6

Cristopher Stauffer

Cristopher StaufferD i r e c t o r o f Te s t E n g i n e e r i n g

cstauffer@yodle.com

A Common Tale

Proof of Concept

Beta Release

Critical Production Application

Critical Business Liability

Founders

Early Startup

Well Funded

Expanding

▶ Unsustainable Development Costs

▶ Unnecessary Development Synchronization

▶ Inability to Scale

▶ Is All Things To All People

Production Liability

Engineering Ecosystem

Entitled Capabilities

▶ Sustainable

▶ Independent

▶ Scalable

▶ Clear Intention

Our System

▶ Microservices Architecture (2 Years Ago)▶ HTTP Microservices▶ Docker Containers▶ Unit, Function Testing▶ On Demand Deployment▶ Infrastructure Monitoring

Our System

▶ Microservices Architecture (18 Months Ago)▶ Promoting Only Microservice development▶ Promoting breaking apart monoliths▶ Deploying 100 releases a month

What were we seeing?

Nov-14

Jan-15

Mar-15

May-15

Jul-15

Sep-15

Nov-15

Jan-16

Mar-16

May-16

050

100150200250

Service Count

What were we seeing?

Nov-14

Jan-15

Mar-15

May-15

Jul-15

Sep-15

Nov-15

Jan-16

Mar-16

May-16

0200400600800

100012001400

Environment Deployments

Our System

▶ Microservices Architecture (18 Months Ago)▶ Out of Memory▶ Out of Disk Space▶ Diagnosing Performance Degradations▶ Misconfigurations▶ Missed certification steps▶ Bad merges

Our System

▶ Customer/User Dissatisfaction

▶ Loss in Engineering Confidence

Pain Points

▶ Manually Set Static Configuration

▶ Manual Monitoring

▶ Process Level Health Checks

▶ Manual CD Pipeline (even with automated tests)

▶ All or Nothing Deployments

Why Was It So Painful?

▶ Manually Set Static Configuration

▶ Manual Monitoring

▶ Process Level Health Checks

▶ Manual CD Pipeline (even with automated tests)

▶ All or Nothing Deployments

Pain Points

Why Was It So Painful?

Won’t Scale

Think About a Selfless Stack

“My new service is going to use upall the memory on the host, but it needs it”

“If you say so!”

Think About a Selfless Stack

“Not all the tests passed but gettingthis out is really important”

“You know best!”

Think About a Selfless Stack

“The best way for me monitor my new appIs this new metrics tool nobody has used”

“I’m sure there’sA good reason…”

Think About a Selfless Stack

“A-OK!”

....xyzabc…..

....xyzabc…..

....xyzabc…..

....xyzabc…..

....xyzabc…..

....xyzabc…..

“A-OK!”

“A-OK!”

“A-OK!”

“A-OK!”

Think About a Selfless Stack

....xyzabc…..

....xyzabc…..

....xyzabc…..

....xyzabc…..

....xyzabc…..

....xyzabc…..

“I’M NOT OK!”

“I’M NOT OK!”

“I’M NOT OK!”

“I’M NOT OK!”

▶ Correct Configuration and Routing

▶ Error Detection and Resolution

▶ Utilization and Optimization of Resources

▶ Protecting System Integrity

What did we need?

▶ Correct Configuration and Routing

▶ Error Detection and Resolution

▶ Utilization and Optimization of Resources

▶ Protecting System Integrity

What did we need?

Selfish Stack

V

I am SelfishI CARE ABOUT YOU

JUST NOT AS MUCHAS I CARE ABOUT MYSELF

V

ErrorDetectionAndAlerting

▶ New Relic Monitoring For Microservices and Legacy Apps

▶ Simple – just add an agent

▶ Detailed per application dashboards out of the box

▶ Single score to focus attention

Error Detection and Alerting

Base Docker Image

▶ Docker Engine▶ Ubuntu Image▶ JVM Image (e.g. Java 8)▶ New Relic Agent▶ Microservice Image

Error Detection and Alerting

100 Apps in 100 Days

▶ Made use of our base containers

▶ Rolled out monitoring to every application in the fleet

▶ Suddenly we had visibility everywhere.

▶ Alerting was based on a team ownership model

Error Detection and Alerting

Is this a Selfish System?

▶ Pool of Docker containers glued together

▶ Engineers are alerted

▶ Engineers make changes

▶ Engineers make the call

Is this a Selfish System?

▶ Pool of Docker containers glued together

▶ Engineers are alerted

▶ Engineers make changes

▶ Engineers make the call

Not Selfish

V

ConfigurationAndApplicationOrchestration

▶ Configuration

▶ Provisioning

▶ Routing

▶ Resource Balancing

What Engineers Should Be Focusing On

▶ Delivering customer value

▶ Satisfying internal needs

▶ Improving system resiliency

▶ Increasing Engineering productivity

Pain Points

Back to Our Pain Points

▶ Manually Set Static Configuration

▶ Manual Monitoring

▶ Process Level Health Checks

▶ Manual CD Pipeline (even with automated tests)

▶ All or Nothing Deployments

▶ Utilizes Mesos/Marathon Technology

▶ Highly Available

▶ Container Orchestration Platform

▶ Capable of intelligently limiting resources and balancing load

▶ Discovering and Routing Aware

▶ Capable of detecting ‘unhealthy’ applications

Platform as a Service

Basic Workflow

▶ Deploy applications to PaaS

▶ PaaS decides what host and port to run applications on

▶ PaaS determines if resources are available

▶ Health checks are built in to ensure application uptime

Platform as a Service

Health Checks

▶ Complements New Relic as startup validation

▶ Addresses risks of nodes hard crashing and not recovering

▶ Addresses risk of non-reporting New Relic hosts due to OOM

▶ Attempts to always maintain recommended instance count

Platform as a Service

Resource Utilization

▶ Re-balancing of applications across fleet of nodes

▶ Safeguards for CPU and memory starvation

▶ Ability to scale on demand (still human driven)

▶ Protective of over allocation

Platform as a Service

Pain Points

Back to Our Pain Points

▶ Manually Set Static Configuration

▶ Manual Monitoring

▶ Process Level Health Checks

▶ Manual CD Pipeline (even with automated tests)

▶ All or Nothing Deployments

▶ Manually Set Static Configuration

▶ Manual Monitoring

▶ Process Level Health Checks

▶ Manual CD Pipeline (even with automated tests)

▶ All or Nothing Deployments

Pain Points

Back to Our Pain Points

Little Selfish

▶ Manually Set Static Configuration

▶ Manual Monitoring

▶ Process Level Health Checks

▶ Manual CD Pipeline (even with automated tests)

▶ All or Nothing Deployments

Pain Points

Back to Our Pain Points

Won’t Scale

▶ Correct Configuration and Routing

▶ Error Detection and Resolution

▶ Utilization and Optimization of Resources

▶ Protecting System Integrity

What did we need?

▶ Correct Configuration and Routing

▶ Error Detection and Resolution

▶ Utilization and Optimization of Resources

▶ Protecting System Integrity

What did we need?

▶ Correct Configuration and Routing

▶ Error Detection and Resolution

▶ Utilization and Optimization of Resources

▶ Protecting System Integrity

What did we need?

V

SafeContinuousDelivery

Regressions give comfort

▶ Monolithic releases are understandable

▶ We tested everything

▶ Everything works

Continuous Integration

Release code as it is written

Continuous Delivery Pipeline

Develop

Commit to Branch

Continuous Integration

Merge

Continuous Delivery

Regressions Are Resource Intensive

▶ Empower continuous delivery

▶ Focused – Highly Selective to Integration Testing

Continuous Integration

Enter the Canary

▶ Landscape is in flux

▶ If we test a subset of things how can we be sure everything works?

▶ Canary Ensures▶ Dependencies met▶ Satisfying existing contracts▶ Handle production load

Continuous Delivery Pipeline

Canary Pipeline

▶ Special canary routing in our service discovery layer

▶ Test anywhere in the service mesh

▶ Discoverable tests using a /tests endpoint

▶ Monitor canary health in New Relic

Canary Isolated

▶ Receives no production traffic

▶ Reports to New Relic using unique name

▶ Discoverable and routable by Canary Tests

▶ Monitored for a configurable amount of time

▶ Triggers rollback if Canary Tests fail or New Relic reports Yellow/Red

Canary Partial

▶ Receives % of production traffic

▶ Reports to New Relic using unique name

▶ Monitored for a configurable amount of time

▶ Ensures similar response times and return codes

▶ Ensures similar CPU / memory utilization

▶ Triggers rollback if New Relic reports Yellow/Red

Continuous Delivery Pipeline

▶ Receives general production traffic

▶ Reports to New Relic under unique name

▶ Monitored for a configurable amount of time

▶ Ensures similar response times and return codes

▶ Ensures similar CPU / Memory utilization

▶ Triggers rollback if New Relic reports Yellow/Red

The Actors

▶ No Human involvement

▶ Bamboo Build Agent

▶ Launch Pad (Custom Microservice) orchestrates Canary Process

▶ Cerebro (Custom Microservice) retrieves sensory information:

▶ Service Discovery▶ Health Check▶ New Relic

V

Is ThisSelfish?

Think About the System

“My new service is going to use upall the memory on the host, but it needs it”

“Yeah…I don’t havethat to spare.”

Think About the System

“Not all the tests passed but gettingthis out is really important”

“No test passing; No way I’m deploying”

Think About the System

“The best way for me monitor my new appIs this new metrics tool nobody has used”

“So I only speak to my friend New Relic and he said your app just slowed down by 5x…”

Think About the System

“I feel really good about this”

“Me too…”

“Bad News Kid…I ran out of worker nodes…”

The Result

▶ Environmental Consistency

▶ Process That Is Appealing

▶ Early Detection and Response

▶ Instant Intervention and Rollback

Pain Points

Back to Our Pain Points

▶ Manually Set Static Configuration

▶ Manual Monitoring

▶ Process Level Health Checks

▶ Manual CD Pipeline (even with automated tests)

▶ All or Nothing Deployments

Pain Points

Back to Our Pain Points

▶ Manually Set Static Configuration

▶ Manual Monitoring

▶ Process Level Health Checks

▶ Manual CD Pipeline (even with automated tests)

▶ All or Nothing Deployments

▶ Manually Set Static Configuration

▶ Manual Monitoring

▶ Process Level Health Checks

▶ Manual CD Pipeline (even with automated tests)

▶ All or Nothing Deployments

Pain Points

Back to Our Pain Points

Selfish

▶ Manually Set Static Configuration

▶ Manual Monitoring

▶ Process Level Health Checks

▶ Manual CD Pipeline (even with automated tests)

▶ All or Nothing Deployments

Pain Points

Back to Our Pain Points

Does Scale

What we now see?

Nov-14

Jan-15

Mar-15

May-15

Jul-15

Sep-15

Nov-15

Jan-16

Mar-16

May-16

0

50

100

150

200

250Yodle Service Count

What we now see?

Nov-14

Jan-15

Mar-15

May-15

Jul-15

Sep-15

Nov-15

Jan-16

Mar-16

May-16

0200400600800

100012001400

Monthly Deployments

Entitled Capabilities

▶ Sustainable

▶ Independent

▶ Scalable

▶ Clear Intention

▶ Autonomic Systems informed our direction

▶ Automatic Decisions are made based on basic health stats and New Relic Data

▶ Imagine additional sensors▶ Database Load▶ Service Mesh Health▶ Custom Metrics▶ Browser Product Data

▶ Give your CI/CD Processes insights

Extending This Idea

V

I am SelfishI CARE ABOUT YOU

BY CARING ABOUT MYSELF