16
How It All Goes Down Your next service outage will be during @ctosummit at 1:50pm. Daniel Doubrovkine [email protected] @dblockdotorg

How it All Goes Down

Embed Size (px)

DESCRIPTION

"How it All Goes Down" presented at CTO Summit, NYC, November 3rd, 2014.

Citation preview

Page 1: How it All Goes Down

How It All Goes DownYour next service outage will be during

@ctosummit at 1:50pm.

Daniel Doubrovkine [email protected]

@dblockdotorg

Page 2: How it All Goes Down
Page 3: How it All Goes Down

Do you have a Syrian domain?

somewhere here there’s your .sy TLD authority

Page 4: How it All Goes Down

Are you using MongoDB?

Page 5: How it All Goes Down

Do you use a PAAS?

Page 6: How it All Goes Down

Is there any logic involved?1. Restart Unicorns, no improvement.2. Restart a server, no improvement.3. Found an EC2 maintenance note buried in email.4. Patch responsible for 5x slower response times!5. Do a DB query from a production console.6. Identify a pattern (slow, then fast).7. Look in MongoDB logs.

Page 7: How it All Goes Down

Are there humans involved?

Page 8: How it All Goes Down

Are you using a beta version of a driver?

Page 9: How it All Goes Down

Post-Mortem

Page 10: How it All Goes Down

Subject

Page 11: How it All Goes Down

SummaryWe have experienced 3 separate outages over the last 72 hours that have affected api.artsy.net which is the backbone of all our applications, including Admin, CMS, artsy.net, m.artsy.net, etc.

The first incident was slower response time 9/28 2AM-10AM EST.

The second and third incidents were two outages, 9/29 11PM-12AM EST and 9/30 4AM-6:30AM EST.

Page 12: How it All Goes Down

Cause

1. TriggerServers rebooted.

2. Unexpected behavior.Reboots should have been handled just fine by software.

3. Cause.Human error + bug in the driver.

4. Consequence.All front-ends down.

Page 13: How it All Goes Down

Resolution

1. TicketWoke up a contractor.

2. Manual intervention.Kick servers.

Page 14: How it All Goes Down

Post-Mortem

1. Human Error PreventableThe dismissal of the alert was wrong.

2. Failure to Plan.Reboot schedule needed a human to monitor it.

Page 15: How it All Goes Down

Outage History

Page 16: How it All Goes Down

Thanks!

Daniel Doubrovkine [email protected]

@dblockdotorg