Upload
others
View
92
Download
0
Embed Size (px)
Gamifying Operational Excellence
TheService Score Card
1 The Problem
3 A Solution tour
4 The results
5 Take aways & lessons Learnt & Questions
2 A Solution idea
Agenda
“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about meDanny ☃ Lawrence
“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about meDanny ☃ Lawrence
“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about meDanny ☃ Lawrence
“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about meDanny ☃ Lawrence
“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about meDanny ☃ Lawrence
Good news SRECON.
You passed the ☃ test.
about meDanny ☃ Lawrence
Some terms(before we really get started)
Operational Excellenceeffective and efficient delivery of information, technology, and services required by end users
that add measurable value.
10
Gamifying Operational Excellence
Operational ExcellenceDoing everything required to make sure
all of your services are as fast and as reliable as possible.
11
Gamifying Operational Excellence
Gamificationapplication of game-design elements and game
principles in non-game contexts.
12
Gamifying Operational Excellence
Some background(LinkedIn SRE crash course)
Mostly JavaMultitudes of services
Doing lots of thingsService-oriented architectureEverything talks to everything
My direct team looks after 80+ servicesWe have 200+ SREs
14
LinkedIn SRE Crash Course
The Problem(What started this whole thing)
Problem 1:The GOOD
& The BAD
16
Gamifying Operational Excellence
BAD serviceswake me up
17
Gamifying Operational Excellence
GOOD serviceslet me sleep
18
Gamifying Operational Excellence
What makes a GOOD service at LinkedIn is a moving target.
19
Gamifying Operational Excellence
Technologies and dependencies change
over time.
20
Gamifying Operational Excellence
Upgrading dependencies & libraries Java / Jetty / Play / Tomcat
Correct usage of TLSSwitching databases / caches
Migrate from SVN to GITReduce application startup time
Setup error budgetingTrue up the number of metrics
21
Some examples
A GOOD service can turn into a BAD service.
If you are not checking it
22
Gamifying Operational Excellence
UnfortunatelyBAD services
do not magically turn into
GOOD services23
Gamifying Operational Excellence
Problem 2:Knowing what is BAD
24
Gamifying Operational Excellence
Problem 3:Knowing why it’s BAD
25
Gamifying Operational Excellence
Problem 4:Tribal knowledge
about how to get to GOOD
26
Gamifying Operational Excellence
The only thing we appear to hate more than not having documentation,
...Is writing documentation.
27
Gamifying Operational Excellence
The Problemsummary
BAD services wake me upTime will cause GOOD to turn BAD
Hard to know what is BADHard to know why is BAD
Not sure how to fix the BAD
29
Gamifying Operational Excellence
The Service ScoreCard(A solution)
In order determine the healthof the services we support,
we define a list of production requirements.
31
Gamifying Operational Excellence
Apply a weight to each requirement
32
Gamifying Operational Excellence
Codify each requirement into a check.
33
Gamifying Operational Excellence
Execute these checksfor each service
34
Service Scorecard
Tally up the results for service.
35
Gamifying Operational Excellence
Grade the service from“F” to “A+”
36
Gamifying Operational Excellence
Add all the services into a highscore system
37
Gamifying Operational Excellence
Then
38
Gamifying Operational Excellence
Publish those scores to the company
39
Gamifying Operational Excellence
This is great, but how do I improve the score?
How can I add X check into the system.
40
Gamifying Operational Excellence
What makes a check?
checks are one type of plugin.
fetch plugins gather datacheck plugins check the data.
42
Gamifying Operational Excellence
We use the fetch plugin to gather remote data from:
SVN, GIT, Configuration DBs,host databases, monitoring systems, build systems, deployment systems.
43
Gamifying Operational Excellence
Basically,if we can fetch it,
then we do so.
44
Gamifying Operational Excellence
We build a giant context object.
45
Gamifying Operational Excellence
The check plugin will look at our context object.
46
Gamifying Operational Excellence
All plugins are small python scripts,where small is 10~30 LOC
47
Gamifying Operational Excellence
Simply return 2 or 3 things.
state*: True, False, None or 0.0 - 1.0message*: short stringdata: python dict of interesting things.
48
Gamifying Operational Excellence
Example fetch plugin
@ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered data”, o.json()
50
@ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered data”, o.json()
51
@ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered data”, o.json()
52
@ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered data”, o.json()
53
@ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered owner data”, o.json()
54
Example check plugin
@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
56
@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
57
@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
58
@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
59
@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
60
@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
61
@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
62
Putting it all together
Problems
Understanding what is BADKnowing why it is BAD
Not sure how to fix the BAD
64
Gamifying Operational Excellence
Problems
Understanding what is BAD
65
Gamifying Operational Excellence
66
Service Scorecard
67
Service Scorecard
68
Service Scorecard
69
Service Scorecard
70
Service Scorecard
71
Service Scorecard
72
Service Scorecard
73
Service Scorecard
74
Service Scorecard
75
Service Scorecard
76
Service Scorecard
77
Service Scorecard
78
Service Scorecard
79
Service Scorecard
Problems
Understanding what is BADKnowing why it is BAD
80
Gamifying Operational Excellence
81
Service Scorecard
82
Service Scorecard
83
84
85
86
87
88
89
90
91
92
93
Problems
Understanding what is BADKnowing why it is BAD
Not sure how to fix the BAD
94
Gamifying Operational Excellence
95
96
97
98
99
What is the check?Why is it important?
How long it will take to fix?How will it be fixed?
100
Gamifying Operational Excellence
101
102
AngularJSimage: CC BY 4.0 https://angular.io/presskit.html (2017)
103
{{service_name}}becomes
jobs-server
104
105
{{context.ownership.eng_owner}}becomesjobs-team
Using our fetched data in the wiki
107
{{service_name}}
109
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = ctx;
}
);
110
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = ctx;
}
);
111
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = ctx;
}
);
112
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = ctx;
}
);
113
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = data;
}
);
114
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = ctx;
}
);
115
{{ctx.ownership.owner_eng}}
116
{{ctx.ownership.owner_eng}}
{{ctx.number_of_hosts}}
{{ctx.product.lib.jetty.version}}
{{ctx.hosts.hostnames}}
{{ctx.is_deployed_in_prod}}
{{ctx.commits.last_commit}}
Problems
Understanding what is BADKnowing why it is BAD
Not sure how to fix the BAD
117
Gamifying Operational Excellence
Now
Reports show what is BADChecks validate why it is BADWiki shows how to fix the BAD
118
Gamifying Operational Excellence
No more of these emails
“If you use a lib-core, then upgrade it, we found a bug”
119
Gamifying Operational Excellence
How many of my 80 services use this lib?How do I check?
How do I upgrade?
120
Gamifying Operational Excellence
121
122
123
Where does this tool fit?
125
Gamifying Operational Excellence
pre-commit Build Deployment Monitoring
126
Gamifying Operational Excellence
pre-commit Build Deployment Monitoring
Service Scorecard
127
Gamifying Operational Excellence
pre-commit Build Deployment Monitoring
Service Scorecard
API
128
Gamifying Operational Excellence
Service Scorecard
API
hack-days Reporting Deployment Monitoring
Results &
Outcomes
What we do with the scores?
130
Gamifying Operational Excellence
Priority #1:Getting the grades better
131
Gamifying Operational Excellence
132
When we started Now
Average grade for my team 40% 80%
Average score across SRE 35% 60%
Checks in 24 hours 15,560 89,859
Number of checks per service 15 31
Center the source to page, and align to bottom of page number. Do not increase in size, and keep on one line.
Gamifying Operational Excellence
We can now explore news ways to use the scores
133
Gamifying Operational Excellence
Carrot&
Stick
134
Gamifying Operational Excellence
Carrot / GOOD
Stick / BAD
135
Gamifying Operational Excellence
No SRE supportfor
F Gradeservices.
136
Gamifying Operational Excellence
F Grade services generally cause the
most problems.
137
Gamifying Operational Excellence
No deploy moratorium for
A+ services
138
Gamifying Operational Excellence
A+ services generally cause the
least problems.
139
Gamifying Operational Excellence
A servicesare allowed to deploy 24/7
140
Gamifying Operational Excellence
Premium SRE support for A+ services
141
Gamifying Operational Excellence
Priority build queuesfor
GOODServices.
142
Gamifying Operational Excellence
Tiger teams to raise the scores on
F Grade services
143
Gamifying Operational Excellence
Hack Days
144
Gamifying Operational Excellence
FREE BEER
145
Gamifying Operational Excellence
Basically any problem can be solve with
FREE BEER
146
Gamifying Operational Excellence
OR T-Shirts
147
Gamifying Operational Excellence
/
148
Influence where we allocate open headcount
149
Gamifying Operational Excellence
Simple way to get things done
150
Gamifying Operational Excellence
Take aways&
Lessons Learnt
Everyone cares about Reliability.
152
Gamifying Operational Excellence
Everyone cares about Reliability,
Everyone is a Site Reliability Engineer.
153
Gamifying Operational Excellence
Everyone cares about Reliability,
You just need to empower them.
154
Gamifying Operational Excellence
Hack Days are important,
This POC was built in an afternoon.
155
Gamifying Operational Excellence
Getting the data was easy,
Finding interesting ways to use it is hard.
156
Gamifying Operational Excellence
Make it as easy as possible to do the right thing.
157
Gamifying Operational Excellence
Cheers !
Q & A