AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
1
Functionality of the Control System after infrastructure failures
Alastair Bland (AB/CO/IN)with help from Enzo Genuardi and Jean Juillard
http://ab-co-tech-committee.web.cern.ch/ab-co-tech-committee/
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
2
Introduction
Intended audience
Who to call when there are problems
Building 874 infrastructure (access, powering, cooling)
Building 874 computers
First steps to take when there is an infrastructure failure
What the TI operators need to work
Power off (voluntary or forced) then power on in the CCC and CCR
Recommendations for the future
Annex: Simplified Timeline of 29/7/2006 power cut
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
3
Intended Audience
Anyone who might have to deal with a major power, cooling or network problem affecting the AB Control System and in particular the Control Room environment
– Technical Infrastructure (TI) operators
– Accelerator operators dealing with Access, Radiation, etc.(getting beams back is not covered – this requires Timing Experts and a lot of Front End rebooting!)
– AB/CO Exploitation Team
– AB/CO Specialists
– AB/CO Supervisors
Emphasis is on Building 874, Prevessin
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
4
Who to call when there are problems
When there is a serious power or network problem the first difficulty in the modern paperless world is who to call and what is their number!
– Windows: the phone book if already by the user on the computer should be cached on the hard disk
– Linux: in an xterm type:/usr/bin/phone SURNAME
This will not work if IT Building 513 is powered down or not available due to network problems. Starting an xterm is difficult when the AB/CO NFS file servers ABSRV1 or CS-CCR-FEOP are not available
– Legacy HPUX: in an xterm type:cd /user/pcrops/production/phonex./xpb
This program has certain advantages over all the others but the database may not be up to date.
Do not forget that your own portable phone usually has a good list of colleagues. If Sunrise GSM does not work switch to another operator if your subscription allows this.
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
5
Useful phone numbers
NAME Group Speciality Telephone GSM HomeSCHMICKLER Hermann AB/CO Group Leader 77078 74788 164004SICARD Claude-Henri AB/CO Exploitation Manager 73071CHARRUE Pierre AB/CO/IN Section Leader 75410 163230BLAND Alastair AB/CO/IN Windows/Linux Software 75568 163727GENUARDI Enzo AB/CO/IN Linux/HPUX Hardware 75537 163395BAKKER Dirk AB/CO/IN Video distribution 75575 163235BALLET-THOUBLE Rene AB/CO/IN Control Room hardware 75637 75144 163231ELYN Jean-Michel AB/CO/IN Linux software 78754 163591GLAFIROV Vladimir AB/CO/IN Windows/Linux Software 79369 SIGERUD Katarina AB/CO/AP Laser Alarm System 71464 79898 164648STAPLEY Niall AB/CO/AP Laser Alarm System 75834 79898 160918DE METZ-NOBLAT Nicolas AB/CO/FE Front Ends, Linux 73487 163070SURBACK Guy AB/CO/FE Front Ends, remote reboot 72718
The CO Exploitation team could also be called, especially if the PS is affected.
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
6
AB/CO Organigram of September 2006
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
7
Building 874 plan (ground floor only)
Warning: this plan is old and does not show the corridor correctly!
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
8
How to get into Building 874
Normally you will need access privileges CCCPRIV and CCRPRIV
– Ask for this via EDH
– You need the special cards with the RFID chip, it is worth testing that you can get in by waving your card in front of the CCC and CCR readers
During a power failure the access system to the CCC and CCR may fail
The glass doors to the CCC normally default to open without power
– CO have a key to the CCC external doors. As there should always be operators in the CCC this should not be needed
Apparently the access control for the outside doors fails to the locked state
– There is another way into Building 874
Apparently the access control for the CCR fails to locked as well
– The operators’ key can open the CCR
– I recommend blocking the CCR door open with a chair
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
9
CERN Control Centre (CCC)
Warning: this simulation is old, in particular the TI and Cryo desks are joined!
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
10
Powering and Cooling in the CCC
The CCC Cooling is powered by normal electricity
– The CCC does not overheat rapidly, so this is not a problem
There are two sources of normal power: EBD5 and EBD6
– This feeds the ceiling lights and the sixteen 46 inch screens on the wall
– After a power cut use the remote control to waken the wall screens
The UPS power comes from EOD3
– Like EOD2 in the CCR this runs until it drops (no set time limit), on 29/7/2006 it ran around an hour but was within a few tens of minutes of dropping
– All the Consoles should be on EOD3 and almost all the screens are too
The lights on the tables are on EOD3, they are the emergency lighting
The IP telephones are powered from the network starpoint
LHC island
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
11
Computer Control Room (CCR) and Network Starpoint
CCR
<- Intercom -> <- Acc.-> <------- C A T V Machines -------> Teletext <-- ASOS SPS analog, sign.->
Pages
TS/CV <--------- RF - LHC -------> A C C E S <-------------------------- Central Timing -------------------------> <--- PC's from PS ---> S
<---------------------------------- HPUX servers and Linux PC's ------------------------------------>
wireless LAN
<---- CCC/CCR switches and routers ---->
IT/CS Network Starpoint (locked - TI Operators have key) Maintenance Lab (Dirk and Rene)
CCR and Network Starpoint rack layoutwith logical UPS powering arrangement
Alastair Bland (AB/CO/IN), 19/09/2006
based on CCR layout diagram of Rene Ballet-Thouble and Claes Frisk
Thermometer zone HP ProLiants
CCRPRIV card reader
Rack legend:
Tiles with holes netw ork green rack blue rack HP rack console table 6 x netw ork outlets 4 x netw ork outlets
302 303 304 305 306 307free
310 311 312 313 314 315 316 317 318 319 320 321 322 323
RA7405
RA7406free
RA7407free
RA7408free
RA7409free
RA7410free
RA7411Beam Int.
RA 5306RA 5307RA 5308RA 5309RA 5310
1206absrv1
12201219121612111207 tcrsrv1
RA5621
RA5624
RA5616
308free
EOD1
EOD2
EBD4
RA5416
RA5419
RA5421
RA5424
RA6118 RA6119
RA6121
RA6124
606Net.
621Net.
1215Net.
1214elsrv1
1210samoa
WinXPConsole
RA5619
RA5420
RA5620
RA6120
917Net.
Linux Console
917 921914913912911910909908907906
607 608 609 610 611 612 613 614 615 616 617 618 619 620
rechargeable lamp
610606Net.
302 1206
310free
309free
606 621
918
1216stsrv1
1215
Ram- ses
TNRouter
terminal
GPN router+tele
PatchTNSw itches
fiber sw itches
GPN Sw itches
Patch
IT/CSServers
I
B
E
S
TN LHC / SPSTN LHC /
SPS
F
R
EOD9
GPN Prev. Router
Air Con.
AirCon.VentOut
EBD1
togallery
Access
SunriseGSM
cs-ccr-feop
laser2+tcrpl*laser1+tcrpl*
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
12
CCR computer area
CCR
<- Intercom -> <- Acc.-> <------- C A T V Machines -------> Teletext <-- ASOS SPS analog, sign.->
Pages
<--------- RF - LHC -------> A C C E S <-------------------------- Central Timing -------------------------> <--- PC's from PS ---> S
<---------------------------------- HPUX servers and Linux PC's ------------------------------------>
wireless LAN
Maintenance Lab (Dirk and Rene)
Thermometer zone HP ProLiants
302 303 304 305 306 307free
310 311 312 313 314 315 316 317 318 319 320 321 322 323
RA7405
RA7406free
RA7407free
RA7408free
RA7409free
RA7410free
RA7411Beam Int.
RA 5306RA 5307RA 5308RA 5309RA 5310
1206absrv1
12201219121612111207 tcrsrv1
RA5621
RA5624
RA5616
308free
EOD1
EOD2
EBD4
RA5416
RA5419
RA5421
RA5424
RA6118 RA6119
RA6121
RA6124
606Net.
621Net.
1215Net.
1214elsrv1
1210samoa
WinXPConsole
RA5619
RA5420
RA5620
RA6120
917Net.
Linux Console
917 921914913912911910909908907906
607 608 609 610 611 612 613 614 615 616 617 618 619 620
rechargeable lamp
310free
309free
606 621
918
1216stsrv1
1215
Ram- ses
AirCon.VentOut
togallery
Access
cs-ccr-feop
laser2+tcrpl*laser1+tcrpl*
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
13
Powering and cooling in the CCR (general)
The cooling is powered by normal electricity.
– It sucks air out of the floor (“Air Con. Vent out” on the plan)
Once we lose normal power the CCR heats up very quickly
– Normal temperature for rack of LASER2 is 24 degrees centigrade
– It was 33 degrees centigrade at 12:15 on 29/7/2006
– You must block open the door to the corridor, open outside doors, open starpoint door (key from TI operator).
– If more than 35 degrees: open outside door of starpoint and door/windows of maintenance lab. If still too hot you must start switching off equipment
The ceiling lights are on normal power
– There is a rechargeable lamp in the Maintenance lab. Use it as you do not want to fall through the false floor if it is open!
Power is distributed to racks via “normabarres”. The source is clearly labeled.
Please check if any of the “multiprises” have tripped due to overload or the 30mA to earth detection
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
14
Powering and cooling in the CCR (computers)
EOD1 and EOD2 give UPS power. There are no batteries in the CCR, they are fed from the old TCR (building 212) diesels or with normal power or batteries in SE0 (building 924, Prevessin).
The HP type racks are powered from EOD1 and EOD2:
– EOD1 is cut after 10 minutes (this occurred on 29/7/2006)
– EOD2 runs until it drops (it ran without cutting on 29/7/2006)
– HP Proliants, HP network switches in the “starpoint deporté” and disks of ABSRV1 and TCRSRV1 are really dual powered
– Keyboard, Video and Mouse (KVM) switches and the CRT or TFT screens for HP Proliants are powered by EOD1 or EOD2
– All HPUX systems including the two boxes forming ABSRV1 and TCRSRV1 are powered by EOD1 or EOD2
All other racks are powered by EOD1 only
Check voltage and current here
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
15
IT/CS Network Starpoint
TS/CV A C C E S S
<---- CCC/CCR switches and routers ---->
IT/CS Network Starpoint (locked - TI Operators have key)
TNRouter
terminal
GPN router+tele
PatchTNSw itches
fiber sw itches
GPN Sw itches
Patch
IT/CSServers
I
B
E
S
TN LHC / SPSTN LHC /
SPS
F
R
EOD9
GPN Prev. Router
Air Con.
AirCon.VentOut
SunriseGSM
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
16
IT Network Starpoint cooling and powering
The network starpoint is cooled by an air conditioner in the room. It used to leak water but was fixed after 29/7/2006.
The starpoint has two power sources:
– EOD2 from the CCR
– EOD9 in the starpoint. EOD9 contains batteries itself (less than 10 minutes available?). It is fed from normal electricity at the moment.
The Technet and General Purpose Network routers are dual powered.
The IT/CS Spectrum system, IP-DNS-4 and IP-TIME-4 are dual powered.
The HP Procurve Gigabit switches have only one 230V input however they are often (but not always) associated with a HP Redundant Power Supply (RPS) which supplies low voltage power in case they lose 230 volts or the internal power supply fails.
– Beware: we have noticed a tendency for these RPS units to trip their own circuit breakers when there is power loss to the switches.
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
17
First steps to take when there is an infrastructure failure
The TI operators should call us if:
– The air conditioning in the CCR is not running
– We have lost any of the UPS sources (EOD1, 2, 3 or 9)
When you arrive, prepare pen and paper and note down:
– The time (from the CCC Rolexes!)
– The situation now
– Any interventions you perform or people called
• If you have a camera or camera phone take a picture of a Rolex (to synch real time with camera time) then take pictures, preferably without flash, of the racks and equipment before and after you flick a switch back on.
– Save useful logfiles such as /var/log/messages before they are overwritten
– Your leaving timeAll this is vital because there will be “a fact finding mission” or “investigation”
Feel free to call your supervisor (you can delegate decisions and responsibility to him or her!)
You cannot fix the whole Control System on your own: get one of your technical colleagues called in too.
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
18
What the TI operators need to work
Their Windows Consoles
The Phone book
The Electrical Network Supervisor
– This is in the old TCR, building 212, behind a firewall
– Started from “Startup” of TIOP login which starts Exceed to our HPUX system ELSRV1. From there the old X-Motif Console Manager is used to actually start the ENS programs.
The LASER Alarm System
– Started from Java Console Manager (probably needs ABSRV1, CS-CCR-WWW1)
– Needs HP Proliants LASER2 (oc4j + SonicMQ), SLJAS2 and SLJAS3 (SonicMQ). The HPUX system SLJAS1 (SonicMQ) is also probably needed. If LASER1/2 restart then TCRSC01 or 02 Oracle Databases in Building 513 must be running
The TIM system
– Runs on HP Proliants called TCRPL*. Needs TCRSC01 or 02 for login.
XCLUC (for monitoring the Proliants, HPUX systems and Front Ends)
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
19
Scenario
Like 29/7/2006
– We lose normal power
– The diesels start but run only a few minutes
– 10 minutes later we lose EOD1 and EOD9
– You arrive
– Your aim is to make the UPS last as long as possible
Unlike 29/7/2006
– We then lose EOD2
– You have to restart all the HP Proliants, the HPUX systems, etc.
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
20
Power off in the CCC
Organize powering off of all unused Consoles and Screens in CCC. This is done by briefly pushing the HP DC7600 power switch (clean shutdown for Windows, dirty for Linux)
Do not turn off:
– TI consoles and CSAM (CWO-CCC-C0WF, C1WF, C2WC, C2WF, C0WA, C1WA and STCSAM-TCR2)
– 2 Cryo Consoles (CWO-CCC-C4WC and C8WC)
– 2 PS Access Consoles (CWO-CCC-B9WC and B9WF) and Radiation Console (CWO-CCC-B9LC)
– 4 SPS and North Areas Consoles (CWO-CCC-A8WC, A9WC, A8LF and A9LF)
– If the LHC Access system has been installed leave it on
– One Linux and one Windows system in each Island for general use
Turn off most of the Watch4TV Linux boxes (not the Access ones!). Leave one per island at least.
Turn off the wall display Linux systems (CS-CCR-A, B, C and DWALL) as presumably the wall displays are cut already.
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
21
Power Off in the CCR
If you are tempted to economize power in the CCR beware of:
– Briefly pressing the power button on a Linux machine performs a dirty shut down of the machine. This may corrupt:
• SonicMQ databases (LASER1/2, SLJAS*, TCRPL*, ABCOPL8 for Oasis)
• PVSS databases (CS-CCR-Q*, QPS*, WIC01, PIC01)
• Linux/VMware/WindowsXP/Wizcon systems (CS-CCR-CV*)
– Many other machines could be shut down in the CCR but the list has not been drawn up. Try CS-CCR-SPARE*!
To cleanly shut down Linux systems you need to have the root password or be in the list of “sudoers”. Many of you are already in this list which includes all the CO Exploitation team. The command to execute is:
shutdown –h now
Many of the AB/CO/IN team can do this remotely from home. They can also power them back on using the Integrated Lights Out (ILO) web pages.
HP Proliant rack
HP Proliant 380DL G4
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
22
Power on in the CCR
The basic order of restart after a complete power loss is:
– SAMOA (HPUX Yellow Pages Server)
– HPDEPOT (HPUX 11 infrastructure)
– ABSRV1 (composed of either ABSRV2 or ABSRV3)
– TCRSRV1 (composed of either TCRSRV2 or TCRSRV3), STSRV1, ELSRV1
– CS-CCR-FEOP, CS-CCR-FELAB, CS-CCR-NFS*
– CS-CCR-INF* (for XCLUC, Big Brother, LEMON, etc.), ABSPS1
– CS-CCR-WWW1, HPSLWEB (Web Servers), SEATTLE, LSASRV1
– CS-CCR-CMW1, SLJAS1, SLJAS2, SLJAS3 (Middleware)
– LASER1 and LASER2 (Alarms)
– TCRPL* (TIM)
– The rest of the HP Proliants, legacy PC desktop Linux boxes and HPUX systems (may need fsck –y) followed by the Consoles in the CCC
– Make sure the Front Ends for Remote Reboot called RMSPCR and RMSTCR are running.
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
23
Recommendations for the future
Train the TI Operators to:
– Call us quickly (I was called by an SPS operator at 12:15 29/7/2006)
– Check the switches and air conditioning in the CCR and StarpointI have done this for quite a few TI and SPS operators already
Train other members of CO what to do and provide documentationI hope this Technical Committee presentation has achieved this aim
Fix the main weaknesses in the infrastructure
– Errors in HP Proliant cabling - done
– HP Proliant Firmware update – partially done
– Cable HP Procurve switches on EOD2 to avoid trips - started
– Move the XCLUC/Clogger system to a dual powered system – started
– Move our online backup machines CS-CCR-BACKUP* elsewhere (513?) – dialog with IT started
– Consider making all our NFS fileservers dual powered – there are pros and cons, principally software reliability versus hardware reliability
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
24
Annex: Simplified Timeline of 29/7/2006 power cut (1/2)
7h47 : Explosion of the Swiss/French Auto Transfer system
7h49 : Start of the Diesel(s) in Building 212 (Jura)
7h56 : loss of EOD9
– loss of General Network CESAM switch
– loss of switch connecting half the CCC IP phones (including 72200)
8h35 : loss of diesel
8h47 : loss of EOD1, dropped after 10 minutes of missing input power
– Crash and restart of LASER1, LASER2 and CS-CCR-Q4DS3 (cryo), HP claims that this can be fixed by upgrading the firmware. TI Operators probably lost Alarm System at this moment.
– loss of 6 HP Proliants with both inputs connected to EOD1 only (now fixed)
9h23 : Loss of UPS in Building 866 leads to loss of Telephone Node 5 and the backup in Building 58 is not accessible. Total loss of IP + “analog” phones + Sunrise GSM antenna. TI could receive calls via Orange France but not make them.
AB
-CO
-Tec
hnic
al C
omm
ittee
, 21s
t Sep
tem
ber
2006
25
Annex: Simplified Timeline of 29/7/2006 power cut (2/2)
9h44 : Re-powering of EBD5 and EBD4 but EBD1 in CCR trips and EBD1, EBD2, EBD3 and EBD6 are not working. Overload in the line before EOD9 means EOD9 is also not re-powered. Building 866 telephony re-powered.
9h45 : Re-powering of EOD2 and EOD3 from normal power. However source of EOD1 does not have automatic re-enable so EOD1 is down.
9h46 : Loss of the Technical Network switches because the “disjoncteur” has tripped
9h50 : Telephone Node 5 restarts, “analog” phones work.
13h45: Source of EOD1 is manually re-enabled. So EOD1 now works.Slightly before the “Star Point deporté” for in particular LASER2 and CS-CCR-FEOP was fixed, as were the Technet Switches in the main Starpoint.
adapted from “Rapport coupure CCC du 29-07-2006.pdf” by Jean Juillard (AB/OP/TI)
see also “Report on the Power Cut of 29/7/2006” by Alastair Bland at the end of https://edms.cern.ch/file/766492/1/MAM27.doc