Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

  • View
    21

  • Download
    0

Embed Size (px)

Text of Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

  • H E R D I N G C AT S T O A F I R E F I G H TT H E E V O L U T I O N O F A N E N G I N E E R I N G O N - C A L L T E A M

    G . C H A N G @ G R E Y S C H A L E

    E U R U K O 2 0 1 6

  • T H E Y E A R 1 B . C . ( B E F O R E C AT S )

  • In the beginning,

    there was only

    darkness.

  • But suddenly,

    out of the darkness,

    there came a sound...

  • (pager noises)

  • One person was on-call. All day. And night. Every day. Every week. Forever.

  • (not really, but close enough)

  • Why not start having a rotation?

    "We don't need no stinkin' on-call rotation!"

  • Bullshit.

  • "Hi, sorry to be calling at this hour. I'm from Yammer, I work with _____. Can I please speak with him?"

    Date: Friday, xxth of XXX, 2013Time: 03:00 AM GMT -0800

  • T H E Y E A R 1 A . D . ( A F T E R D I S A S T E R )

  • How to maths?!?!

    Given:

    Given:

    Given: (1 + 15 5 ) * 2

    ((4 / 1) * ((1 + 15 5) * 2)) = ???

    Answer:

  • How to acronyms?!?!

    M T B FM T T R A A R S L A

  • MTBF: Mean Time Between Failures

    MTTR: Mean Time To Recovery

    SLA: Service Level Agreement

    AAR: After Action Review

    IR: Incident Report

    OMGWTFBBQAFK

  • M T B F M T T R

    less frequent faster recovery

    requires morestable systems

    needs good response training

    engineers interrupted less often

    engineers gainbroad knowledge

    possibly more disastrous issues

    possibly more frequent issues

  • Google Docs Forms

    Yammer Notes

    JIRA

    (hard to read reports)

    (hard to analyse)

    (not perfect...but sort of works)

  • Hey, we're starting to get this!. . . . . . . . . . . .

  • Actually, not yet.

    System grows faster than we can learn about it

    Silos appear when you don't share knowledge

    Who's cleaning up this mess, anyway?

    Burnout is real

  • T H E R E N A I S S A N C E ( G R O W I N G PA I N S )

  • Do more by doing less

    Split responsibilities by stack

    Added London office for follow-the-sun coverage

    Onboard everybody to the process

    Practice, practice, practice

  • All hands on deck

    Keep all alerts in a configuration repo

    Managers aren't doing anything, anyway -- make them Incident Managers!

    Runbooks, runbooks everywhere (and a unified one)

    Make the initial response as simple as possible

  • B A C K T O T H E F U T U R E ( T H E P R E S E N T )

  • Combined schedules

    Fewer rotations

    Team is unified, so schedules should be too

  • Post-mortems and retrospectives

    What? Where? Who? Why? How?

    NO blame game

  • Weekly hand-overs and monthly reviews

    Previous week engineers to current week engineers

    Track top alerts and resolutions (or lack of)

    Focus on the noisiest services

    Timezones are hard

  • Bi-monthly surveys

    Summarise overall preparedness

    Make sure we're improving

    ...and that nobody is actually burned out

  • Fix ALL the alerts

    Noisy

    Flaky

    Real

  • W H E R E A R E T H E C AT S N O W ? !

  • The end game

    1 alert per person per day

    Service owners are on-call for those services

    The world is full of kittens!

  • Isn't on-call just for Ops?

    No

    Responsibility for our code

    Pride in our code

    No pain, no gain

  • Isn't on-call just for Ops?

    No

    Responsibility for our code

    Pride in our code

    No pain, no gain

  • After all...

    we are all cats being herded.

  • T H A N K Y O U

    @ G R E Y S C H A L E

    G . C H A N G @ G R E Y S C H A L E

    E U R U K O 2 0 1 6