Disaster Management at the Tier-1 Andrew Sansum 2 nd April 2009 RAL

Disaster Management at the Tier-1

Andrew Sansum2nd April 2009

RAL

Do You Recognise This

10 April 2023 Tier-1 Status

Burnt out UPS battery at ASGC

Clearly a Disaster


Do You Recognise This?


Challenger Disaster


Cause of Challenger Disaster

• It was the “O” rings wasn’t it? “[The Rogers commission] found that the Challenger accident was caused by a failure in the O-rings … The failure of the O-rings was attributed to a design flaw, as their performance could be too easily compromised by factors including the low temperature on the day of launch”

• Yes but there were underlying cause(s)– Communication Problems

“..failures in communication... resulted in a decision to launch 51-L based on incomplete and sometimes misleading information, a conflict between engineering data and management judgments, and a NASA management structure that permitted internal flight safety problems to bypass key Shuttle managers.”

– Management Errors:“The Commission found that as early as 1977, NASA managers had not only known about the flawed O-ring, but that it had the potential for catastrophe.”


Why considered a disaster?

• People died.“Challenger disintegrated about seventy-three seconds after launch, killing the seven astronauts aboard”

• NASA’s reputation was badly damaged:“It also represented a serious blow to NASA's reputation, colouring the public perception of piloted spaceflight ..”

• Financial losses and reduced funding opportunity“…and affecting the agency's ability to gain continued funding from Congress.”

• Couldn’t meet operational commitments“Following the Challenger disaster, NASA grounded the remainder of the shuttle fleet while the risks were assessed more thoroughly, design flaws were identified, and modifications were developed and implemented.”


Identify Potential Disasters

• We do not (usually) mean the same thing when we say disaster as is meant by the “Challenger Disaster”

• Nevertheless there are many outcomes we wish to avoid

• Tier-1 Disaster Management plan seeks to identify circumstances that have a potential to significantly impact:– Safety– Services Commitments– Reputation– Financial

Some Disasters

• Can construct list of obvious disasters. Eg:– Fire/Flood etc– Loss of network– Security incident– We did this in the form of a risk analysis: DPv0.8.mht

• Also have previous experience– CASTOR 2.1.7 upgrade– Disk firmware problems made it impossible to run delivered

H/W– R89 delays (unable to manage deliveries)– Backplane burnout (not a disaster but very close)

• Common themes:– The ones we generated tended to be operational and start

suddenly– The ones we suffered were slow moving project management

• Also need to be able to manage un-thought of disasters

Evolution of a Disaster

Sometimes fast

Sometimes slow but similar

result

A Strategy

• Create a Disaster Management System which handles all potential disasters in a similar way.

• Identify common features and trigger levels to allow us to spot events before they blossom into disaster

• Mess with existing processes as little as possible• Build specific contingency plans which add to the

general response in specific circumstances.• Trigger early, trigger often, respond ahead of curve

– Make use of the system routinely– Stops the system decaying– gives operational and project management benefits

Don’t Confuse Disaster with Routine OPS

10 April 2023

Loss of power not a disaster ….. but ….Failure of routine restart may lead to disaster

Routine Operations

• We already have:– Production Team (Gareth, John Kelly and Tiju)– Admin on Duty (daytime)– on-call (nighttime)

• Routine operations should be:– Looking for problems– Fixing things– calling experts– Notifying users– setting downtimes– assessing seriousness – reviewing events – improving future response

• Not part of Disaster Management System– But prevents many things moving into the system

Need Escalating Response

• Start lightweight (Stage 1: Disaster Potential). – informally Assess/triage– Monitor/compare against standard contingencies– Set deadlines– watch for things leaving expected script but avoid interfering

• Add some internal management (Disaster Possible)– Add internal (group) oversight– Formally assess– interfere more, divert resources

• escalate response to imminent disaster (Disaster Likely)– Broaden oversight and expertise (include GRIDPP +

department)– regular meetings with experiments– prepare contingencies

• Manage actual disaster (stage 4: Disaster)

At each stage

• Formal list of pre-defined communications– Notify team of deadline to escalation– Notify PMB incident is moving onto disaster track– Notify esc senior staff – Advise Press & PR (as disaster approaches)– ….

• Formal list of actions that should be carried out – eg:– Define Roles– Hold Incident Review Meeting– Start process to obtain financial approval– arrange exceptional experiment liaison meeting– review policy documents– ….

• Formal list of criteria that get you to next stage

Contingency Plans

• Contingency plans supplement general disaster management system.

• For each stage in the general system – supplement with:– Criteria to get (avoid) to this stage– Actions to take at stage– Communications make at stage

• Example Contingency PlanContingency_Plan_Major_Security_Incident.mht


Conclusions

• Disaster Management System is working. Already managed:– Site DNS failure (reached Stage 1)– Power failure (reached stage 2)

• Doesn’t replace our existing processes – But does make sure they are responding correctly

• Expect it to manage equally well:– Operations failures (network down and out)– Project management failures (building delivered late)– Unexpected problems (eg man from mars at door)

• Working well and giving immediate benefit• Doesn’t avoid planning for aftermath of building fire

(but will help manage situation)– Still working on contingency planning and experiment

requirements10 April 2023

Documents

Disaster Management at the Tier-1 Andrew Sansum 2 nd April 2009 RAL