Click here to load reader
View
29
Download
0
Embed Size (px)
Gamifying Operational Excellence
The Service Score Card
1 The Problem
3 A Solution tour
4 The results
5 Take aways & lessons Learnt & Questions
2 A Solution idea
Agenda
“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about me Danny ☃ Lawrence
“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about me Danny ☃ Lawrence
“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about me Danny ☃ Lawrence
“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about me Danny ☃ Lawrence
“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about me Danny ☃ Lawrence
Good news SRECON.
You passed the ☃ test.
about me Danny ☃ Lawrence
Some terms (before we really get started)
Operational Excellence effective and efficient delivery of information, technology, and services required by end users
that add measurable value.
10
Gamifying Operational Excellence
Operational Excellence Doing everything required to make sure
all of your services are as fast and as reliable as possible.
11
Gamifying Operational Excellence
Gamification application of game-design elements and game
principles in non-game contexts.
12
Gamifying Operational Excellence
Some background (LinkedIn SRE crash course)
Mostly Java Multitudes of services
Doing lots of things Service-oriented architecture Everything talks to everything
My direct team looks after 80+ services We have 200+ SREs
14
LinkedIn SRE Crash Course
The Problem (What started this whole thing)
Problem 1: The GOOD
& The BAD
16
Gamifying Operational Excellence
BAD services wake me up
17
Gamifying Operational Excellence
GOOD services let me sleep
18
Gamifying Operational Excellence
What makes a GOOD service at LinkedIn is a moving target.
19
Gamifying Operational Excellence
Technologies and dependencies change
over time.
20
Gamifying Operational Excellence
Upgrading dependencies & libraries Java / Jetty / Play / Tomcat
Correct usage of TLS Switching databases / caches
Migrate from SVN to GIT Reduce application startup time
Setup error budgeting True up the number of metrics
21
Some examples
A GOOD service can turn into a BAD service.
If you are not checking it
22
Gamifying Operational Excellence
Unfortunately BAD services
do not magically turn into
GOOD services 23
Gamifying Operational Excellence
Problem 2: Knowing what is BAD
24
Gamifying Operational Excellence
Problem 3: Knowing why it’s BAD
25
Gamifying Operational Excellence
Problem 4: Tribal knowledge
about how to get to GOOD
26
Gamifying Operational Excellence
The only thing we appear to hate more than not having documentation,
... Is writing documentation.
27
Gamifying Operational Excellence
The Problem summary
BAD services wake me up Time will cause GOOD to turn BAD
Hard to know what is BAD Hard to know why is BAD
Not sure how to fix the BAD
29
Gamifying Operational Excellence
The Service ScoreCard (A solution)
In order determine the health of the services we support,
we define a list of production requirements.
31
Gamifying Operational Excellence
Apply a weight to each requirement
32
Gamifying Operational Excellence
Codify each requirement into a check.
33
Gamifying Operational Excellence
Execute these checks for each service
34
Service Scorecard
Tally up the results for service.
35
Gamifying Operational Excellence
Grade the service from “F” to “A+”
36
Gamifying Operational Excellence
Add all the services into a highscore system
37
Gamifying Operational Excellence
Then
38
Gamifying Operational Excellence
Publish those scores to the company
39
Gamifying Operational Excellence
This is great, but how do I improve the score?
How can I add X check into the system.
40
Gamifying Operational Excellence
What makes a check?
checks are one type of plugin.
fetch plugins gather data check plugins check the data.
42
Gamifying Operational Excellence
We use the fetch plugin to gather remote data from:
SVN, GIT, Configuration DBs, host databases, monitoring systems, build systems, deployment systems.
43
Gamifying Operational Excellence
Basically, if we can fetch it,
then we do so.
44
Gamifying Operational Excellence
We build a giant context object.
45
Gamifying Operational Excellence
The check plugin will look at our context object.
46
Gamifying Operational Excellence
All plugins are small python scripts, where small is 10~30 LOC
47
Gamifying Operational Excellence
Simply return 2 or 3 things.
state*: True, False, None or 0.0 - 1.0 message*: short string data: python dict of interesting things.
48
Gamifying Operational Excellence
Example fetch plugin
@ssc.tags(“ownership”) def fetch_ownership(service_name): “Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered data”, o.json()
50
http://owners/
@ssc.tags(“ownership”) def fetch_ownership(service_name): “Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered data”, o.json()
51
http://owners/
@ssc.tags(“ownership”) def fetch_ownership(service_name): “Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered data”, o.json()
52
http://owners/
@ssc.tags(“ownership”) def fetch_ownership(service_name): “Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered data”, o.json()
53
http://owners/
@ssc.tags(“ownership”) def fetch_ownership(service_name): “Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered owner data”, o.json()
54
http://owners/
Example check plugin
@ssc.weight(5) @ssc.tags(‘ownership’) @ssc.wiki(‘http://wiki/ssc_eng_owner’) def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
56
@ssc.weight(5) @ssc.tags(‘ownership’) @ssc.wiki(‘http://wiki/ssc_eng_owner’) def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
57
@ssc.weight(5) @ssc.tags(‘ownership’) @ssc.wiki(‘http://wiki/ssc_eng_owner’) def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
58
@ssc.weight(5) @ssc.tags(‘ownership’) @ssc.wiki(‘http://wiki/ssc_eng_owner’) def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
59
@ssc.weight(5) @ss