25
The Evolution of Site Reliability Engineering Ben Purgason Director, Site Reliability

The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

The Evolution of Site Reliability Engineering

Ben PurgasonDirector, Site Reliability

Page 2: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

The Founding Principles

Site Up Empower Developer Ownership

Operations is an Engineering Problem

Page 3: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

• Incident management

• Purely reactive

• Keeps the company alive one more day

The Firefighter

Page 4: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

SRE - SWE Roles and Responsibilities

• Incident Management

• Automating manual fire suppression

• Seeking to understand the stack

• Monitoring

• Alerting

SRE (Firefighter) SWE• Feature / Product Development

• Escalation Point for SRE

Page 5: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

Tools SREThe Firefighters

• Something was always broken, literally

• GCN, post mortem, action Items, repeat

Page 6: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

If you can do justtwo things…

“If you’re going through hell, keep going.” – Winston Churchill

• Every day is Monday in Operations

• What gets measured gets fixed

Page 7: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

• Change control

• Reactive towards SWE plans

• Protect “our” site from “them”

The Gatekeeper

Page 8: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

SRE - SWE Roles and Responsibilities

• Incident Management

• Deployments

• Change Control

• Monitoring

• Alerting

SRE (Gatekeeper) SWE• Feature / Product Development

• Request Changes from SRE

• Escalation Point for SRE

Page 9: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

Tools SREThe Gatekeepers

• Human gatekeeping doesn’t scale

• Service Guard, dividing users since 2014

Page 10: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

If you can do justtwo things…

“There is no such thing as ‘the hole is in your side of the boat.’” – Fred Kofman

• Attack the problem, not the person

• No human gatekeepers. Build automated gatekeepers that use mutually agreed upon data.

Page 11: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

C e n t e r o f G r a v i t y -

The principle thing or activity that must be keptin balance or under control for an org to operate

Page 12: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

C e n t e r o f G r a v i t y -

“The ability to influence and beinfluenced by our partner teams”

Page 13: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

• Creating a site up culture

• Reactive towards SWE plans

• Rebuilds trusted relationshipsThe Advocate

Page 14: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

SRE - SWE Roles and Responsibilities

• Incident Management

• Monitoring and Alerting

• Partnering in the creation of “gate keeping data”

• Developing systems that empower ownership

• Relentless propagation of Site Up culture

SRE (Advocate) SWE• Feature / Product Development

• Escalation Point for SRE

• Monitoring and Alerting

• Partnering in the creation of “gate keeping data”

Page 15: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

Tools SREThe Advocates

• Site up helps everyone.

• Help us help you.

• How do you want to spend your time?

Page 16: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

If you can do justtwo things…

“Consistency over time equals trust.” – Jeff Weiner

• Be an advocate – make an advocate

• Do not insulate, share the pain

Page 17: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

• Empowers intelligent risk

• Proactive, joint planning with SWE

• Collaborating to magnify impactThe Partner

Page 18: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

SRE - SWE Roles and Responsibilities

• Incident Management

• Monitoring and Alerting

• Building products for reliability and scale

• Relentless propagation of Site Up culture

SRE (Partner) SWE

• Incident Management

• Monitoring and Alerting

• Building products for reliability and scale

• Feature / Product Development

Page 19: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

Tools SREThe Partners

• All teams plan together with partners

• Contributions to core libraries

• Contributions across org boundaries

Page 20: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

If you can do justthree things…

“We should operate on what needs to get done, not on an org structure!” – Dan Grillo

• Unified SRE and SWE planning

• Unified SRE and SWE priorities

• Contribute where it counts

Page 21: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

• Reliability throughout software lifecycle

• Proactive, one plan for SRE+SWE

• Everyone has the same job.

The Engineer

Page 22: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

SRE Evolution

The Partner

The Gatekeeper

The Advocate

The Engineer

The Firefighter

Page 23: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

Want to have a conversation?

Page 24: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

Thank you

Page 25: The Evolution of Site Reliability Engineering · your side of the boat.’” –Fred Kofman •Attack the problem, not the person •No human gatekeepers. Build automated gatekeepers

+