View
911
Download
3
Embed Size (px)
DESCRIPTION
At Netflix, we realize that there’s a tension between the availability of our service and our speed of innovation. If we move slowly, we can be very available -- but that’s not a good business proposition. If we move super fast, we risk downtime -- and that might annoy our customers. But what if we could increase our velocity without significantly impacting availability? How can we shift that curve so that we’re moving faster without dropping any of those coveted 9’s? How can we engineer velocity by weaving together tooling and culture with software development to expose and elevate highly effective practices? This talk describes various components of Netflix’s continuous delivery platform -- much of which is available in open source. I’ll show how these pieces fit together and allow us to build scaffolding so that we’re comfortable with software developers making the decision to push the button for prod deployment -- and helps them to recover if necessary. As a result, we can run fast, trusting our tooling and our culture. I’ll also describe how we test our resiliency through simulating failure, unleashing the monkeys (Simian Army) on our production environment. Because if you’re afraid of cute little monkeys, imagine how afraid you’ll be of a production environment that offers those same risks but doesn’t give you an opportunity to test your response to those dangers. Throughout this talk, I hope that you will challenge yourself to consider how your company can "shift the curve" through tooling and to achieve a high velocity environment without negatively impacting reliability.
Citation preview
Engineering Velocity: Continuous Delivery at Netflix
Dianne Marsh SATURN 2014
en-gi-neer-ing + ve-loc-i-ty !applying science and technology to designing and building speed
into a system
Availability vs. Rate of ChangeAv
aila
blity
(in
9’s)
0
1
2
3
4
5
6
Rate of Change0 10 100 1000
Shift the CurveAv
aila
blity
(in
9’s)
0
1
2
3
4
5
6
Rate of Change0 10 100 1000 10000
http://www.slideshare.net/reed2001/culture-1798664
Manager’s Role
Context, not Control
Loosely coupled, Tightly aligned
And hire well!
Get out of the Way
Freedom to Innovate
Support Experimentation
!
How We Built a Predictive
Autoscaling Engine
http://techblog.netflix.com/2013/11/scryer-netflixs-predictive-auto-scaling.html
Support Independent Paths of Exploration Don’t Prematurely Optimize!
Blameless Culture
Developers Deploy Their Code
Run What You Wrote
!
• Rapid Innovation
• Rapid Detection
• Rapid Response
!
= Freedom + Responsibility
Support with Tools
Jenkins Job DSL
Configuration as Code
Groovy Script
Scripts go in Version Control
http://www.slideshare.net/quidryan/configuration-as-code
Aminator
Create AMI from Base AMI
Image contains service and everything needed to run it
Unit of Deployment for Test and Prod
Abstracts Cloud Details
http://techblog.netflix.com/2013/03/ami-creation-with-aminator.html
Asgard
Deploys Netflix to the Cloud
Red/Black push
Developed to address delays in rollback
http://www.infoq.com/presentations/asgard
Red/Black Push!
• Scale up new instances
• Run canary analysis
• Turn on traffic to new ASG
• Turn off traffic to old ASG
• Wait … analyze … continue
Workflow
Continuous Delivery Engine
Judges between Stages
Represent Best Practices
http://techblog.netflix.com/2013/09/glisten-groovy-way-to-use-amazons.html
One Click Deployment?
Regional IsolationLimit Impact of Human Error
!
• Stagger Deployments?
• Canary Testing per Region?
!
Know your Service!
Multi-Region ConsistencyBuild Tooling to:
!
• Schedule Deployments
• Prefer Off-Peak
• Choose Next Available Region
• Provide Visibility by Region
Simian Army
• Chaos Monkey
• Latency Monkey
• Conformity Monkey
• Janitor Monkey (and more)
http://www.infoq.com/presentations/netflix-resiliency-failure-cloud
Chaos Monkey
Kills Running Instances
• Simulates failures inherent to running in the cloud
• In Production
Latency Monkey
Introduces Latency between services
Conformity Monkey
Have Deployments Diverged?
• Balance Regional Consistency with Regional Isolation
• Build Best Practices into Tooling and Reporting
Janitor Monkey
Reduce Cognitive Load and Cost
• Remove unused instances
• Uniform way to clean up
Shifting the Curve with Tooling
• Value Self-Service
• Test Everywhere
• Awareness of Multiple Regions
• Best Practices Represented in Tooling
• Recover Quickly and Easily
• Be Cloud Native
Shifting the Curve with Culture
• Context not Control
• Freedom to Experiment
• Blameless Culture
ArsTechnica, November 2012
“As the number of applications and the scale of the campaign's AWS infrastructure use
climbed, the DevOps team shifted to using Asgard—an open-source tool developed by
Netflix to manage cloud deployments.”