Upload
dan-cundiff
View
84
Download
0
Embed Size (px)
Citation preview
Things to put in a wiki page:
● How to get access● Profile setup● GitHub pages, emoji, issues● How to integrate with common tools● Link to post-mortems, how to reach admins● Link to GitHub docs for the rest
Emergency:“We understand emergencies happen from time to time, so we've made available an email address you can write which will page us: ***. Literally, if you email this address, it will interrupt our dinner with our families or wake us up at 2:00am with SMS, push notices, ringing phones, beeps and all kinds of things which immediately grab our attention.”
Hey GHE users - here’s the post mortem as promised. It’s with a tear we write this as we almost reached a full year of zero unplanned downtime
resulting from a GHE defect or an action we took. But it was an action we took a few weeks ago that brought it down last night.
● On 2015-04-30, we upgraded to GHE v2.2.0. It was the first time GitHub strongly recommended taking an ESXi VM snapshot before
beginning the upgrade (we normally don’t because our rebuild procedure is well practiced.) We followed the recommendation.
● On 2015-05-21 at 9:35 PM CDT a Runscope synthetic transaction failed which triggered PagerDuty to call both GHE admins.
● git1 VM was hard down, but git2 was up, so we knew it wasn’t a network issue, but we didn’t want to failover given the state of the VM (if
we did, we might just cause the same issue on the hot standby.)
● By 10:15 PM we determined the culprit was a snapshot had exceeded the available disk space on the same volume where GHE runs.
● We deleted the snapshot; the subsequent consolidation process took about an hour to process.
● At 11:46 PM we brought GHE back online and it was working fine.
● We verified HA replication was functioning correctly AND that backups were running as normal. No data was lost (the thing we care about
most).
From now on, we’ll place the snapshot on a separate volume and/or remove the snapshot as soon as we determine it’s no longer needed.
While what happened is an embarrassing n00b mistake, we think it’s still important to talk about it and learn from it.
As always, let us know if you have any feedback for us. We always want to make this thing better.
“You guys rock. I can only imagine a world where every system we all use day-to-day had this much visibility about mistakes and, more importantly, how they will prevent them going forward. I’m going to challenge myself to start
doing retrospectives when my stuff fails.”
“...if you have to have a n00b mistake, the night before a 3
day weekend is the BEST time! On a serious note, keep up
the great work! LOVE my git!”
“The honesty is absolutely wonderful, thank you for not withholding bad news. Y’all look more professional as a
result.”