SPOF - Single "Person" of Failure

Preview:

Citation preview

Single Point of Failure… ExpertSasha Rosenbaum, @DivineOps

Who am I?

Sasha Rosenbaum

Azure & DevOps consultant

at 10th Magnitude for 4 years

Co-organizer of

- DevOps Days Chicago Conference

- Chicago Azure meetup

@DivineOps

What is a Single Point of Failure?

@DivineOps

A single point of failure (SPOF) is a part of a system that, if it fails, will stop the

entire system from working

@DivineOps

High Availability

Achieving redundancy by removing single points of failure

Having reliable cross-over capabilities to switch between components

Detection of failures as they occur, so that cross-over can be initiated

@DivineOps

This is complicated

@DivineOps

Architecting for HA

@DivineOps

How is the entire system down?

@DivineOps

We forgot a dependency!

@DivineOps

Oh…

@DivineOps

Just imagine buying a server that

Uptime of roughly 16 hours a day

With interruptions

Single one of its kind

Cannot be replicated!

@DivineOps

Humans are NOT highly available

@DivineOps

How did we get here?

Lack of budget

Lack of people

Human nature

@DivineOps

How to recognize that you have a problem?

@DivineOps

1

@DivineOps

Keys to the Kingdom

@DivineOps

TO MY PRODUCTION SERVER @DivineOps

Even when the systems are automated there are still humans who manage them

@DivineOps

Why is there a single admin?

The situation evolved organically from having a small team

Someone took over deliberately

@DivineOps

Role Based Access

Grant access based on a role/group

Admin group size > 1

Service accounts

@DivineOps

Make sure that the person on call has the necessary access to fix the problem

@DivineOps

TRUST YOUR PEOPLE!!!

@DivineOps

2

@DivineOps

Beware of the Expert!

@DivineOps

“This will take 15 minutes to fix

And 8 hours to explain”

@DivineOps

We cannot afford the loss of productivity!

@DivineOps

Can you afford losing this knowledge?

@DivineOps

Delegate to Juniors

@DivineOps

Juniors are wonderful people

They ask tough questions

@DivineOps

Your new hires haven’t yet caught the

“This is how it’s always been” virus

@DivineOps

You are emotionally invested in your code

It is hard not to get protective of it

@DivineOps

Documentation

Documents

Readme

Comments

Tests

Automation

Features

@DivineOps

3

@DivineOps

“I cannot afford to take vacation!”

@DivineOps

Job security?

@DivineOps

Productivity?

@DivineOps

Hours / Productivity

@DivineOps

Research shows that working longer hours

DOES NOT increase productivity

@DivineOps

You need rest to be at your best!

@DivineOps

Cell phones are the single worse thing that happened to people AND businesses in the last century

@DivineOps

If people were actually unreachable we would find a more reliable way to solve problems

@DivineOps

Mandatory Vacation

@DivineOps

Game Days

@DivineOps

Say NO to having a

Single PERSON of Failure ;-)

@DivineOps

Great job, DoD Silicon Valley!

@DivineOps

Recommended