View
695
Download
2
Category
Preview:
Citation preview
The Selfish StackF u t u r e S t a c k – A u g u s t 2 0 1 6
Cristopher Stauffer
Cristopher StaufferD i r e c t o r o f Te s t E n g i n e e r i n g
cstauffer@yodle.com
A Common Tale
Proof of Concept
Beta Release
Critical Production Application
Critical Business Liability
Founders
Early Startup
Well Funded
Expanding
▶ Unsustainable Development Costs
▶ Unnecessary Development Synchronization
▶ Inability to Scale
▶ Is All Things To All People
Production Liability
Engineering Ecosystem
Entitled Capabilities
▶ Sustainable
▶ Independent
▶ Scalable
▶ Clear Intention
Our System
▶ Microservices Architecture (2 Years Ago)▶ HTTP Microservices▶ Docker Containers▶ Unit, Function Testing▶ On Demand Deployment▶ Infrastructure Monitoring
Our System
▶ Microservices Architecture (18 Months Ago)▶ Promoting Only Microservice development▶ Promoting breaking apart monoliths▶ Deploying 100 releases a month
What were we seeing?
Nov-14
Jan-15
Mar-15
May-15
Jul-15
Sep-15
Nov-15
Jan-16
Mar-16
May-16
050
100150200250
Service Count
What were we seeing?
Nov-14
Jan-15
Mar-15
May-15
Jul-15
Sep-15
Nov-15
Jan-16
Mar-16
May-16
0200400600800
100012001400
Environment Deployments
Our System
▶ Microservices Architecture (18 Months Ago)▶ Out of Memory▶ Out of Disk Space▶ Diagnosing Performance Degradations▶ Misconfigurations▶ Missed certification steps▶ Bad merges
Our System
▶ Customer/User Dissatisfaction
▶ Loss in Engineering Confidence
Pain Points
▶ Manually Set Static Configuration
▶ Manual Monitoring
▶ Process Level Health Checks
▶ Manual CD Pipeline (even with automated tests)
▶ All or Nothing Deployments
Why Was It So Painful?
▶ Manually Set Static Configuration
▶ Manual Monitoring
▶ Process Level Health Checks
▶ Manual CD Pipeline (even with automated tests)
▶ All or Nothing Deployments
Pain Points
Why Was It So Painful?
Won’t Scale
Think About a Selfless Stack
“My new service is going to use upall the memory on the host, but it needs it”
“If you say so!”
Think About a Selfless Stack
“Not all the tests passed but gettingthis out is really important”
“You know best!”
Think About a Selfless Stack
“The best way for me monitor my new appIs this new metrics tool nobody has used”
“I’m sure there’sA good reason…”
Think About a Selfless Stack
“A-OK!”
....xyzabc…..
....xyzabc…..
....xyzabc…..
....xyzabc…..
....xyzabc…..
....xyzabc…..
“A-OK!”
“A-OK!”
“A-OK!”
“A-OK!”
Think About a Selfless Stack
....xyzabc…..
....xyzabc…..
....xyzabc…..
....xyzabc…..
....xyzabc…..
....xyzabc…..
“I’M NOT OK!”
“I’M NOT OK!”
“I’M NOT OK!”
“I’M NOT OK!”
▶ Correct Configuration and Routing
▶ Error Detection and Resolution
▶ Utilization and Optimization of Resources
▶ Protecting System Integrity
What did we need?
▶ Correct Configuration and Routing
▶ Error Detection and Resolution
▶ Utilization and Optimization of Resources
▶ Protecting System Integrity
What did we need?
Selfish Stack
V
I am SelfishI CARE ABOUT YOU
JUST NOT AS MUCHAS I CARE ABOUT MYSELF
V
ErrorDetectionAndAlerting
▶ New Relic Monitoring For Microservices and Legacy Apps
▶ Simple – just add an agent
▶ Detailed per application dashboards out of the box
▶ Single score to focus attention
Error Detection and Alerting
Base Docker Image
▶ Docker Engine▶ Ubuntu Image▶ JVM Image (e.g. Java 8)▶ New Relic Agent▶ Microservice Image
Error Detection and Alerting
100 Apps in 100 Days
▶ Made use of our base containers
▶ Rolled out monitoring to every application in the fleet
▶ Suddenly we had visibility everywhere.
▶ Alerting was based on a team ownership model
Error Detection and Alerting
Is this a Selfish System?
▶ Pool of Docker containers glued together
▶ Engineers are alerted
▶ Engineers make changes
▶ Engineers make the call
Is this a Selfish System?
▶ Pool of Docker containers glued together
▶ Engineers are alerted
▶ Engineers make changes
▶ Engineers make the call
Not Selfish
V
ConfigurationAndApplicationOrchestration
▶ Configuration
▶ Provisioning
▶ Routing
▶ Resource Balancing
What Engineers Should Be Focusing On
▶ Delivering customer value
▶ Satisfying internal needs
▶ Improving system resiliency
▶ Increasing Engineering productivity
Pain Points
Back to Our Pain Points
▶ Manually Set Static Configuration
▶ Manual Monitoring
▶ Process Level Health Checks
▶ Manual CD Pipeline (even with automated tests)
▶ All or Nothing Deployments
▶ Utilizes Mesos/Marathon Technology
▶ Highly Available
▶ Container Orchestration Platform
▶ Capable of intelligently limiting resources and balancing load
▶ Discovering and Routing Aware
▶ Capable of detecting ‘unhealthy’ applications
Platform as a Service
Basic Workflow
▶ Deploy applications to PaaS
▶ PaaS decides what host and port to run applications on
▶ PaaS determines if resources are available
▶ Health checks are built in to ensure application uptime
Platform as a Service
Health Checks
▶ Complements New Relic as startup validation
▶ Addresses risks of nodes hard crashing and not recovering
▶ Addresses risk of non-reporting New Relic hosts due to OOM
▶ Attempts to always maintain recommended instance count
Platform as a Service
Resource Utilization
▶ Re-balancing of applications across fleet of nodes
▶ Safeguards for CPU and memory starvation
▶ Ability to scale on demand (still human driven)
▶ Protective of over allocation
Platform as a Service
Pain Points
Back to Our Pain Points
▶ Manually Set Static Configuration
▶ Manual Monitoring
▶ Process Level Health Checks
▶ Manual CD Pipeline (even with automated tests)
▶ All or Nothing Deployments
▶ Manually Set Static Configuration
▶ Manual Monitoring
▶ Process Level Health Checks
▶ Manual CD Pipeline (even with automated tests)
▶ All or Nothing Deployments
Pain Points
Back to Our Pain Points
Little Selfish
▶ Manually Set Static Configuration
▶ Manual Monitoring
▶ Process Level Health Checks
▶ Manual CD Pipeline (even with automated tests)
▶ All or Nothing Deployments
Pain Points
Back to Our Pain Points
Won’t Scale
▶ Correct Configuration and Routing
▶ Error Detection and Resolution
▶ Utilization and Optimization of Resources
▶ Protecting System Integrity
What did we need?
▶ Correct Configuration and Routing
▶ Error Detection and Resolution
▶ Utilization and Optimization of Resources
▶ Protecting System Integrity
What did we need?
▶ Correct Configuration and Routing
▶ Error Detection and Resolution
▶ Utilization and Optimization of Resources
▶ Protecting System Integrity
What did we need?
V
SafeContinuousDelivery
Regressions give comfort
▶ Monolithic releases are understandable
▶ We tested everything
▶ Everything works
Continuous Integration
Release code as it is written
Continuous Delivery Pipeline
Develop
Commit to Branch
Continuous Integration
Merge
Continuous Delivery
Regressions Are Resource Intensive
▶ Empower continuous delivery
▶ Focused – Highly Selective to Integration Testing
Continuous Integration
Enter the Canary
▶ Landscape is in flux
▶ If we test a subset of things how can we be sure everything works?
▶ Canary Ensures▶ Dependencies met▶ Satisfying existing contracts▶ Handle production load
Continuous Delivery Pipeline
Canary Pipeline
▶ Special canary routing in our service discovery layer
▶ Test anywhere in the service mesh
▶ Discoverable tests using a /tests endpoint
▶ Monitor canary health in New Relic
Canary Isolated
▶ Receives no production traffic
▶ Reports to New Relic using unique name
▶ Discoverable and routable by Canary Tests
▶ Monitored for a configurable amount of time
▶ Triggers rollback if Canary Tests fail or New Relic reports Yellow/Red
Canary Partial
▶ Receives % of production traffic
▶ Reports to New Relic using unique name
▶ Monitored for a configurable amount of time
▶ Ensures similar response times and return codes
▶ Ensures similar CPU / memory utilization
▶ Triggers rollback if New Relic reports Yellow/Red
Continuous Delivery Pipeline
▶ Receives general production traffic
▶ Reports to New Relic under unique name
▶ Monitored for a configurable amount of time
▶ Ensures similar response times and return codes
▶ Ensures similar CPU / Memory utilization
▶ Triggers rollback if New Relic reports Yellow/Red
The Actors
▶ No Human involvement
▶ Bamboo Build Agent
▶ Launch Pad (Custom Microservice) orchestrates Canary Process
▶ Cerebro (Custom Microservice) retrieves sensory information:
▶ Service Discovery▶ Health Check▶ New Relic
V
Is ThisSelfish?
Think About the System
“My new service is going to use upall the memory on the host, but it needs it”
“Yeah…I don’t havethat to spare.”
Think About the System
“Not all the tests passed but gettingthis out is really important”
“No test passing; No way I’m deploying”
Think About the System
“The best way for me monitor my new appIs this new metrics tool nobody has used”
“So I only speak to my friend New Relic and he said your app just slowed down by 5x…”
Think About the System
“I feel really good about this”
“Me too…”
“Bad News Kid…I ran out of worker nodes…”
The Result
▶ Environmental Consistency
▶ Process That Is Appealing
▶ Early Detection and Response
▶ Instant Intervention and Rollback
Pain Points
Back to Our Pain Points
▶ Manually Set Static Configuration
▶ Manual Monitoring
▶ Process Level Health Checks
▶ Manual CD Pipeline (even with automated tests)
▶ All or Nothing Deployments
Pain Points
Back to Our Pain Points
▶ Manually Set Static Configuration
▶ Manual Monitoring
▶ Process Level Health Checks
▶ Manual CD Pipeline (even with automated tests)
▶ All or Nothing Deployments
▶ Manually Set Static Configuration
▶ Manual Monitoring
▶ Process Level Health Checks
▶ Manual CD Pipeline (even with automated tests)
▶ All or Nothing Deployments
Pain Points
Back to Our Pain Points
Selfish
▶ Manually Set Static Configuration
▶ Manual Monitoring
▶ Process Level Health Checks
▶ Manual CD Pipeline (even with automated tests)
▶ All or Nothing Deployments
Pain Points
Back to Our Pain Points
Does Scale
What we now see?
Nov-14
Jan-15
Mar-15
May-15
Jul-15
Sep-15
Nov-15
Jan-16
Mar-16
May-16
0
50
100
150
200
250Yodle Service Count
What we now see?
Nov-14
Jan-15
Mar-15
May-15
Jul-15
Sep-15
Nov-15
Jan-16
Mar-16
May-16
0200400600800
100012001400
Monthly Deployments
Entitled Capabilities
▶ Sustainable
▶ Independent
▶ Scalable
▶ Clear Intention
▶ Autonomic Systems informed our direction
▶ Automatic Decisions are made based on basic health stats and New Relic Data
▶ Imagine additional sensors▶ Database Load▶ Service Mesh Health▶ Custom Metrics▶ Browser Product Data
▶ Give your CI/CD Processes insights
Extending This Idea
V
I am SelfishI CARE ABOUT YOU
BY CARING ABOUT MYSELF
Recommended