The Selfish Stack [FutureStack16 NYC]

The Selfish StackF u t u r e S t a c k – A u g u s t 2 0 1 6

Cristopher Stauffer

Cristopher StaufferD i r e c t o r o f Te s t E n g i n e e r i n g

cstauffer@yodle.com

A Common Tale

Proof of Concept

Beta Release

Critical Production Application

Critical Business Liability

Founders

Early Startup

Well Funded

Expanding

▶ Unsustainable Development Costs

▶ Unnecessary Development Synchronization

▶ Inability to Scale

▶ Is All Things To All People

Production Liability

Engineering Ecosystem

Entitled Capabilities

▶ Sustainable

▶ Independent

▶ Scalable

▶ Clear Intention

Our System

▶ Microservices Architecture (2 Years Ago)▶ HTTP Microservices▶ Docker Containers▶ Unit, Function Testing▶ On Demand Deployment▶ Infrastructure Monitoring

Our System

▶ Microservices Architecture (18 Months Ago)▶ Promoting Only Microservice development▶ Promoting breaking apart monoliths▶ Deploying 100 releases a month

What were we seeing?

Nov-14

Jan-15

Mar-15

May-15

Jul-15

Sep-15

Nov-15

Jan-16

Mar-16

May-16

100150200250

Service Count

What were we seeing?

Nov-14

Jan-15

Mar-15

May-15

Jul-15

Sep-15

Nov-15

Jan-16

Mar-16

May-16

0200400600800

100012001400

Environment Deployments

Our System

▶ Microservices Architecture (18 Months Ago)▶ Out of Memory▶ Out of Disk Space▶ Diagnosing Performance Degradations▶ Misconfigurations▶ Missed certification steps▶ Bad merges

Our System

▶ Customer/User Dissatisfaction

▶ Loss in Engineering Confidence

Pain Points

▶ Manually Set Static Configuration

▶ Manual Monitoring

▶ Process Level Health Checks

▶ Manual CD Pipeline (even with automated tests)

▶ All or Nothing Deployments

Why Was It So Painful?

Pain Points

Why Was It So Painful?

Won’t Scale

Think About a Selfless Stack

“My new service is going to use upall the memory on the host, but it needs it”

“If you say so!”

“Not all the tests passed but gettingthis out is really important”

“You know best!”

“The best way for me monitor my new appIs this new metrics tool nobody has used”

“I’m sure there’sA good reason…”

“A-OK!”

....xyzabc…..

“A-OK!”

....xyzabc…..

“I’M NOT OK!”

▶ Correct Configuration and Routing

▶ Error Detection and Resolution

▶ Utilization and Optimization of Resources

▶ Protecting System Integrity

What did we need?

Selfish Stack

I am SelfishI CARE ABOUT YOU

JUST NOT AS MUCHAS I CARE ABOUT MYSELF

ErrorDetectionAndAlerting

▶ New Relic Monitoring For Microservices and Legacy Apps

▶ Simple – just add an agent

▶ Detailed per application dashboards out of the box

▶ Single score to focus attention

Error Detection and Alerting

Base Docker Image

▶ Docker Engine▶ Ubuntu Image▶ JVM Image (e.g. Java 8)▶ New Relic Agent▶ Microservice Image

100 Apps in 100 Days

▶ Made use of our base containers

▶ Rolled out monitoring to every application in the fleet

▶ Suddenly we had visibility everywhere.

▶ Alerting was based on a team ownership model

Is this a Selfish System?

▶ Pool of Docker containers glued together

▶ Engineers are alerted

▶ Engineers make changes

▶ Engineers make the call

Is this a Selfish System?

▶ Pool of Docker containers glued together

▶ Engineers are alerted

▶ Engineers make changes

▶ Engineers make the call

Not Selfish

ConfigurationAndApplicationOrchestration

▶ Configuration

▶ Provisioning

▶ Routing

▶ Resource Balancing

What Engineers Should Be Focusing On

▶ Delivering customer value

▶ Satisfying internal needs

▶ Improving system resiliency

▶ Increasing Engineering productivity

Pain Points

Back to Our Pain Points

▶ Utilizes Mesos/Marathon Technology

▶ Highly Available

▶ Container Orchestration Platform

▶ Capable of intelligently limiting resources and balancing load

▶ Discovering and Routing Aware

▶ Capable of detecting ‘unhealthy’ applications

Platform as a Service

Basic Workflow

▶ Deploy applications to PaaS

▶ PaaS decides what host and port to run applications on

▶ PaaS determines if resources are available

▶ Health checks are built in to ensure application uptime

Health Checks

▶ Complements New Relic as startup validation

▶ Addresses risks of nodes hard crashing and not recovering

▶ Addresses risk of non-reporting New Relic hosts due to OOM

▶ Attempts to always maintain recommended instance count

Resource Utilization

▶ Re-balancing of applications across fleet of nodes

▶ Safeguards for CPU and memory starvation

▶ Ability to scale on demand (still human driven)

▶ Protective of over allocation

Pain Points

Little Selfish

Pain Points

Won’t Scale

What did we need?

SafeContinuousDelivery

Regressions give comfort

▶ Monolithic releases are understandable

▶ We tested everything

▶ Everything works

Continuous Integration

Release code as it is written

Continuous Delivery Pipeline

Develop

Commit to Branch

Continuous Delivery

Regressions Are Resource Intensive

▶ Empower continuous delivery

▶ Focused – Highly Selective to Integration Testing

Enter the Canary

▶ Landscape is in flux

▶ If we test a subset of things how can we be sure everything works?

▶ Canary Ensures▶ Dependencies met▶ Satisfying existing contracts▶ Handle production load

Canary Pipeline

▶ Special canary routing in our service discovery layer

▶ Test anywhere in the service mesh

▶ Discoverable tests using a /tests endpoint

▶ Monitor canary health in New Relic

Canary Isolated

▶ Receives no production traffic

▶ Reports to New Relic using unique name

▶ Discoverable and routable by Canary Tests

▶ Monitored for a configurable amount of time

▶ Triggers rollback if Canary Tests fail or New Relic reports Yellow/Red

Canary Partial

▶ Receives % of production traffic

▶ Reports to New Relic using unique name

▶ Ensures similar response times and return codes

▶ Ensures similar CPU / memory utilization

▶ Triggers rollback if New Relic reports Yellow/Red

▶ Receives general production traffic

▶ Reports to New Relic under unique name

▶ Ensures similar response times and return codes

▶ Ensures similar CPU / Memory utilization

▶ Triggers rollback if New Relic reports Yellow/Red

The Actors

▶ No Human involvement

▶ Bamboo Build Agent

▶ Launch Pad (Custom Microservice) orchestrates Canary Process

▶ Cerebro (Custom Microservice) retrieves sensory information:

▶ Service Discovery▶ Health Check▶ New Relic

Is ThisSelfish?

Think About the System

“My new service is going to use upall the memory on the host, but it needs it”

“Yeah…I don’t havethat to spare.”

“Not all the tests passed but gettingthis out is really important”

“No test passing; No way I’m deploying”

“The best way for me monitor my new appIs this new metrics tool nobody has used”

“So I only speak to my friend New Relic and he said your app just slowed down by 5x…”

“I feel really good about this”

“Me too…”

“Bad News Kid…I ran out of worker nodes…”

The Result

▶ Environmental Consistency

▶ Process That Is Appealing

▶ Early Detection and Response

▶ Instant Intervention and Rollback

Pain Points

Selfish

Pain Points

Does Scale

What we now see?

Nov-14

Jan-15

Mar-15

May-15

Jul-15

Sep-15

Nov-15

Jan-16

Mar-16

May-16

250Yodle Service Count

What we now see?

Nov-14

Jan-15

Mar-15

May-15

Jul-15

Sep-15

Nov-15

Jan-16

Mar-16

May-16

0200400600800

100012001400

Monthly Deployments

Entitled Capabilities

▶ Sustainable

▶ Independent

▶ Scalable

▶ Clear Intention

▶ Autonomic Systems informed our direction

▶ Automatic Decisions are made based on basic health stats and New Relic Data

▶ Imagine additional sensors▶ Database Load▶ Service Mesh Health▶ Custom Metrics▶ Browser Product Data

▶ Give your CI/CD Processes insights

Extending This Idea

I am SelfishI CARE ABOUT YOU

BY CARING ABOUT MYSELF

The Selfish Stack [FutureStack16 NYC]

Technology

Creating Modern Metadata Systems [FutureStack16 NYC]

Agility and Control from AWS [FutureStack16]

Selfish Giant

How New Relic Develops Language Agents [FutureStack16]

The Selfish Giant’s Garden - Bits 'N Pieces Puppet Theatrepuppetworld.com/wp-content/uploads/SelfishGiantClassroomGuideAI… · The Selfish Giant’s Garden “The Selfish Giant’s

Keynote: Richard Seroter, Pivotal [FutureStack16]

Selfish Genes and Selfish Memes

Love Can't Wait! Optimizing PageLoad Time of SPAs at Zoosk [FutureStack16]

Creating Modern Metadata Systems with New Relic, Dow Jones [FutureStack16]

Increasing MTBLS with New Relic [FutureStack16 NYC]

Own Your Own Impact: Incident Response at Airbnb [FutureStack16]

Selfish Mining

Selfish Altruism

The Full Story: Managing Change at 100MPH [FutureStack16]

Lew Cirne, FS16 Keynote [FutureStack16]

SELFISH GIANT

“The Selfish Giant”

WD-40 for your DevOps ToolChain, xMatters [FutureStack16]

Data first with Tableau [FutureStack16]

Giant selfish