View
940
Download
1
Category
Tags:
Preview:
DESCRIPTION
A Canary Analysis presentation for QCon/NY 2014
Citation preview
Canary Analyze All the ThingsRoy Rapoport @royrapoport June 12, 2014
Significant contributions by Chris Sanden, @chris_sanden1
Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
2
A Word About Me …
3
A Word About Me …
•About 20 years in technology
3
A Word About Me …
•About 20 years in technology•Systems engineering, networking, software development, QA, release management
3
A Word About Me …
•About 20 years in technology•Systems engineering, networking, software development, QA, release management
•Time at Netflix: 1809 days
3
A Word About Me …
•About 20 years in technology•Systems engineering, networking, software development, QA, release management
•Time at Netflix: 1809 days 4y:11m:14d
3
A Word About Me …
•About 20 years in technology•Systems engineering, networking, software development, QA, release management
•Time at Netflix: 1809 days •At Netflix:
4y:11m:14d
3
A Word About Me …
•About 20 years in technology•Systems engineering, networking, software development, QA, release management
•Time at Netflix: 1809 days •At Netflix:•Systems Engineering, Service Delivery in IT/Ops
4y:11m:14d
3
A Word About Me …
•About 20 years in technology•Systems engineering, networking, software development, QA, release management
•Time at Netflix: 1809 days •At Netflix:•Systems Engineering, Service Delivery in IT/Ops•Troubleshooter and Builder of Python Things[tm] in Product Engineering
4y:11m:14d
3
A Word About Me …
•About 20 years in technology•Systems engineering, networking, software development, QA, release management
•Time at Netflix: 1809 days •At Netflix:•Systems Engineering, Service Delivery in IT/Ops•Troubleshooter and Builder of Python Things[tm] in Product Engineering
•Current role: Insight Engineering in Product Engineering
4y:11m:14d
3
A Word About Me …
•About 20 years in technology•Systems engineering, networking, software development, QA, release management
•Time at Netflix: 1809 days •At Netflix:•Systems Engineering, Service Delivery in IT/Ops•Troubleshooter and Builder of Python Things[tm] in Product Engineering
•Current role: Insight Engineering in Product Engineering•Real-Time Operational Insight
4y:11m:14d
3
A Word About Netflix…
4
A Word About Netflix…Just the Stats
4
A Word About Netflix…
•16 years
Just the Stats
4
A Word About Netflix…
•16 years•2000+ employees
Just the Stats
4
A Word About Netflix…
•16 years•2000+ employees•48 million users
Just the Stats
4
A Word About Netflix…
•16 years•2000+ employees•48 million users•5x10^9 hours/quarter
Just the Stats
4
A Word About Netflix…
5
A Word About Netflix…Freedom and Responsibility Culture
5
A Word About Netflix…
•Optimize speed of innovation Constrain availability Cost will be what cost will be
Freedom and Responsibility Culture
5
A Word About Netflix…
•Optimize speed of innovation Constrain availability Cost will be what cost will be •Hire smart (experienced) people Get out of their way
Freedom and Responsibility Culture
5
A Word About Netflix…
•Optimize speed of innovation Constrain availability Cost will be what cost will be •Hire smart (experienced) people Get out of their way•Anti-process bias
Freedom and Responsibility Culture
5
A Word About Netflix…
6
A Word About Netflix…Technology and Operations
6
A Word About Netflix…
•Service Oriented Architecture
Technology and Operations
6
A Word About Netflix…
•Service Oriented Architecture•Decentralized Operations. You
Technology and Operations
6
A Word About Netflix…
•Service Oriented Architecture•Decentralized Operations. You•Build
Technology and Operations
6
A Word About Netflix…
•Service Oriented Architecture•Decentralized Operations. You•Build•Test
Technology and Operations
6
A Word About Netflix…
•Service Oriented Architecture•Decentralized Operations. You•Build•Test•Deploy
Technology and Operations
6
A Word About Netflix…
•Service Oriented Architecture•Decentralized Operations. You•Build•Test•Deploy•Set up alerting and monitoring
Technology and Operations
6
A Word About Netflix…
•Service Oriented Architecture•Decentralized Operations. You•Build•Test•Deploy•Set up alerting and monitoring•Wake up at 2AM
Technology and Operations
6
Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
7
Why Canary Analysis?
8
So You’ve Just Done a Release
9
So You’ve Just Done a Release> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/cat
9
So You’ve Just Done a Release> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/cat{“response”: “meow”}
9
So You’ve Just Done a Release> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/cat{“response”: “meow”}
9
So You’ve Just Done a Release
10
So You’ve Just Done a Release> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/dog
10
So You’ve Just Done a Release> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/dog{“response”: “woof”}
10
So You’ve Just Done a Release> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/dog{“response”: “woof”}
10
So You’ve Just Done a Release
11
So You’ve Just Done a Release> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/fox
11
So You’ve Just Done a Release> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/fox{“response”: “wa-pa-pa-pa-pa-pa-pow”}
11
So You’ve Just Done a Release> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/fox{“response”: “wa-pa-pa-pa-pa-pa-pow”}
The correct answer to “what does the fox say?” is left an exercise for the reader
11
You Need Better Testing!
12
You Need Better Testing!
Well, yeah
12
You Need Better Testing!
“I’m going to push to production, though I’m pretty sure it’s going to kill the system”
13
- Said no one, ever*
* Hopefully
Rate of Change 1 10 100 1000
0
1
2
3
4
5
6
Avai
labi
lity
(nin
es)
Detour Rate of Change vs Availability
14
Rate of Change 1 10 100 1000
0
1
2
3
4
5
6
Avai
labi
lity
(nin
es)
Detour Rate of Change vs Availability
14
Rate of Change 1 10 100 1000
0
1
2
3
4
5
6
Avai
labi
lity
(nin
es)
Detour Rate of Change vs Availability
14
Rate of Change 1 10 100 1000
0
1
2
3
4
5
6
Avai
labi
lity
(nin
es)
Detour Rate of Change vs Availability
14
Rate of Change 1 10 100 1000
0
1
2
3
4
5
6
Avai
labi
lity
(nin
es)
Detour Rate of Change vs Availability
14
Rate of Change 1 10 100 1000
0
1
2
3
4
5
6
Avai
labi
lity
(nin
es)
Detour Rate of Change vs Availability
Operations Engineering
14
You Need Better Testing!Deployments!
Canary Analysis!!
• A deployment process where • a new change (in behavior, code, or both) • is rolled out into production gradually, • with checkpoints along the way to examine the new (canary) systems • (optionally versus the old (baseline) systems) • and make go/no-go decisions.
15
Canary Analysis Is Not
16
Canary Analysis Is Not
•A replacement for any sort of software testing
16
Canary Analysis Is Not
•A replacement for any sort of software testing•A/B Testing
16
Canary Analysis Is Not
•A replacement for any sort of software testing•A/B Testing•Releasing 100% to production and hoping for the best
16
Version Control System
1000 servers @ 1.0.1
Customers
Build & Deployment
System
Automated Canary Analysis
One Possible Process
17
Version Control System
1000 servers @ 1.0.1
Customers
Build & Deployment
System
1 server @ 1.0.2
Automated Canary Analysis
One Possible Process
17
Version Control System
1000 servers @ 1.0.1
Customers
Build & Deployment
System
Automated Canary Analysis
10 servers @ 1.0.2
One Possible Process
17
Version Control System
1000 servers @ 1.0.1
Customers
Build & Deployment
System
Automated Canary Analysis
1000 servers @ 1.0.2
One Possible Process
17
Version Control System
1000 servers @ 1.0.1
Customers
Build & Deployment
System
Automated Canary Analysis
1000 servers @ 1.0.2
One Possible Process
18
Version Control System Customers
Build & Deployment
System
Automated Canary Analysis
1000 servers @ 1.0.2
One Possible Process
18
Version Control System
1000 servers @ 1.0.1
Customers
Build & Deployment
System
Automated Canary Analysis
1000 servers @ 1.0.2
One Possible Process
19
Version Control System
1000 servers @ 1.0.1
Customers
Build & Deployment
System
Automated Canary Analysis
1000 servers @ 1.0.2
One Possible Process
19
Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
20
Are We There Yet?
21
Are We There Yet?
• We’re not
21
Are We There Yet?
• We’re not
• You’re probably not either
21
Minimally …
22
Minimally …
• Observability
22
Minimally …
• Observability
• Partial traffic routing
22
Minimally …
• Observability
• Partial traffic routing
• Decision-making
22
Better Yet …
23
Better Yet …
• Focus on the Goal
23
Better Yet …
• Focus on the Goal
• Current Baseline Matters
23
Better Yet …
• Focus on the Goal
• Current Baseline Matters
26% fewer errors in canary
23
Better Yet …
• Focus on the Goal
• Current Baseline Matters
• Observability segregation
26% fewer errors in canary
23
Hold On a Minute!
26% fewer errors in canary
24
Hold On a Minute!
26% fewer errors in canary
Mission Accomplished
24
Hold On a Minute!
26% fewer errors in canary
Mission Accomplished
30% fewer requests handled in canary
25
Hold On a Minute!
26
Hold On a Minute!
26
Hold On a Minute!
27
Hold On a Minute!
• Absolute numbers are relatively unimportant
27
Hold On a Minute!
• Absolute numbers are relatively unimportant
• Relative numbers matter
27
Hold On a Minute!
• Absolute numbers are relatively unimportant
• Relative numbers matter• Error rate
27
Hold On a Minute!
• Absolute numbers are relatively unimportant
• Relative numbers matter• Error rate• RPS per CPU cycle
27
Hold On a Minute!
• Absolute numbers are relatively unimportant
• Relative numbers matter• Error rate• RPS per CPU cycle
27
Requests Rate Comparison
So You’ve Got Your Graphs requests
28
Requests Rate Comparison
So You’ve Got Your Graphs requests
28
Requests Rate Comparison
So You’ve Got Your Graphs requests
Type RAM Cores CostBaseline m3.medium 3.75GB 3 $.11/hrCanary m1.small 1.7GB 1 $.06/hr
28
So You’ve Got Your Graphs
29
Automating …
30
Automating …
• Decision
30
Automating …
• Decision
• Execution
30
A Quick Recap
31
A Quick Recap
• Observe
31
A Quick Recap
• Observe
• Segregate metrics
31
A Quick Recap
• Observe
• Segregate metrics
• Partial deploy
31
A Quick Recap
• Observe
• Segregate metrics
• Partial deploy
• Compare to Baseline
31
A Quick Recap
• Observe
• Segregate metrics
• Partial deploy
• Compare to Baseline
• Absolutes are never right
31
A Quick Recap
• Observe
• Segregate metrics
• Partial deploy
• Compare to Baseline
• Absolutes are never right
• Automate decision
31
A Quick Recap
• Observe
• Segregate metrics
• Partial deploy
• Compare to Baseline
• Absolutes are never right
• Automate decision
• Automate execution
31
Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
32
To Save You Some Time …
Not all metrics are created equal
33
To Save You Some Time …
Not all metrics are created equal
Focus on System and Application Metrics
33
To Save You Some Time …
Not all metrics are created equal
Focus on System and Application Metrics
Weight by category (system, latency, etc)
33
To Save You Some Time …
Outliers are out, lying
34
To Save You Some Time …
Outliers are out, lying
Use a group of servers
34
To Save You Some Time …
Outliers are out, lying
Use a group of servers
Balance fidelity with customer impact
34
To Save You Some Time …
Exercise without warmup can result in injury
35
To Save You Some Time …
Exercise without warmup can result in injury
Repeat canary analysis frequently
35
To Save You Some Time …
Exercise without warmup can result in injury
Repeat canary analysis frequently
Both traffic and startup time are factors
35
To Save You Some Time …
vive la différence!
36
To Save You Some Time …
vive la différence!
Hot-OK, Cold-OK
36
To Save You Some Time …
vive la différence!
Hot-OK, Cold-OK
Let Application Owners Choose
36
To Save You Some Time …
Signal is better than no1$#[NO CARRIER]
37
To Save You Some Time …
Signal is better than no1$#[NO CARRIER]
Ignore weak signals
37
Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
38
Good News
39
Good News
39
Good News
• Software-Defined Everything
39
Good News
• Software-Defined Everything
• Incremental Pricing
39
Bad News
40
Bad News
40
Bad News
• Capacity Management
40
Bad News
• Capacity Management
• Unpredictable Inconsistency
40
Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
41
Numbers
42
Numbers
• 752 services in production
42
Numbers
• 752 services in production
• In-house telemetry platform
42
Numbers
• 752 services in production
• In-house telemetry platform
• A few metrics
42
Numbers
• 752 services in production
• In-house telemetry platform
• A few metrics
42
43
Been there.Done that.Manually. Artisanally.
43
Been there.
• Started in the Data Center
Done that.Manually. Artisanally.
43
Been there.
• Started in the Data Center
• Manual, dashboard-driven
Done that.Manually. Artisanally.
43
Been there.Done that.Manually.
44
CPU
Requests
Errors
Been there.Done that.Manually.
45
Been there.Done that.Manually.
45
Been there.Done that.Manually.
46
Been there.Done that.Manually.
46
Been there.Done that.Manually.
47
Been there.Done that.Manually.
47
Been there.Done that.Manually.
48
Been there.Done that.Manually.• Context vs Precision
48
Been there.Done that.Manually.• Context vs Precision
• No …
48
Been there.Done that.Manually.• Context vs Precision
• No …
• Repeatability
48
Been there.Done that.Manually.• Context vs Precision
• No …
• Repeatability
• Trending
48
Been there.Done that.Manually.• Context vs Precision
• No …
• Repeatability
• Trending
• Manual effort is manual
48
So Now What?
49
So Now What?
• Automate Analysis
49
So Now What?
• Automate Analysis
• Took Some Effort
49
So Now What?
• Automate Analysis
• Took Some Effort
• Approach and analytics
49
So Now What?
• Automate Analysis
• Took Some Effort
• Approach and analytics
• Presentation matters
49
Automated Canary Analysis
50
Automated Canary Analysis
51
Automated Canary Analysis
51
Automated Canary Analysis
52
Automated Canary Analysis
53
Automated Canary Analysis
54
For Our Next Trick …
55
For Our Next Trick …
• Configuration GUI
55
For Our Next Trick …
• Configuration GUI• Deployment System Integration
55
For Our Next Trick …
• Configuration GUI• Deployment System Integration• ACA All The Things
55
For Our Next Trick …
• Configuration GUI• Deployment System Integration• ACA All The Things
• OpenConnect firmware updates
55
For Our Next Trick …
• Configuration GUI• Deployment System Integration• ACA All The Things
• OpenConnect firmware updates• Client software changes
55
For Our Next Trick …
• Configuration GUI• Deployment System Integration• ACA All The Things
• OpenConnect firmware updates• Client software changes• Configuration changes in production
55
Summary
56
Summary
• Canary Analysis makes your changes
56
Summary
• Canary Analysis makes your changes• Safer
56
Summary
• Canary Analysis makes your changes• Safer• Faster
56
Summary
• Canary Analysis makes your changes• Safer• Faster• Easier
56
Summary
• Canary Analysis makes your changes• Safer• Faster• Easier
• Most people can start doing it
56
Summary
• Canary Analysis makes your changes• Safer• Faster• Easier
• Most people can start doing it• Everyone can do it better
56
Summary
• Canary Analysis makes your changes• Safer• Faster• Easier
• Most people can start doing it• Everyone can do it better
56
• https://www.flickr.com/photos/cseeman
• https://www.flickr.com/photos/ransomtech
• https://www.flickr.com/photos/dougbrown47
• https://www.flickr.com/photos/andresthor/
• https://www.flickr.com/photos/dougbrown47
• https://www.flickr.com/photos/pkdesigns
Questions, Attributions, Feedback
57
• https://www.flickr.com/photos/cseeman
• https://www.flickr.com/photos/ransomtech
• https://www.flickr.com/photos/dougbrown47
• https://www.flickr.com/photos/andresthor/
• https://www.flickr.com/photos/dougbrown47
• https://www.flickr.com/photos/pkdesigns
Questions, Attributions, Feedback
@royrapoport
57
• https://www.flickr.com/photos/cseeman
• https://www.flickr.com/photos/ransomtech
• https://www.flickr.com/photos/dougbrown47
• https://www.flickr.com/photos/andresthor/
• https://www.flickr.com/photos/dougbrown47
• https://www.flickr.com/photos/pkdesigns
Questions, Attributions, Feedback
@royrapoport ?57
Recommended