Upload
hilary-welch
View
215
Download
0
Embed Size (px)
DESCRIPTION
11th April 2003Tim Adye3 BaBar Batch Users at RAL (running at least one non-trivial job each week)
Citation preview
11th April 2003 Tim Adye 1
RAL Tier A RAL Tier A StatusStatus
Tim AdyeRutherford Appleton Laboratory
BaBar UK Collaboration MeetingLiverpool
11th April 2003
11th April 2003 Tim Adye 2
BaBar Batch CPU Use at RAL
0
20,000
40,000
60,000
80,000
100,000
120,000
Week Beginning
BaB
ar C
PU
Hou
rs p
er W
eek
(Nor
mal
ised
to P
450)
SPUK UsersNon-UK Users
Full usage at full efficiency of BaBar CPUs = 106,624 Hours/Week; 59,733 according to MOU
11th April 2003 Tim Adye 3
BaBar Batch Users at RAL(running at least one non-trivial job each week)
0
5
10
15
20
25
30
35
40
45
Week Beginning
BaB
ar U
sers
per
Wee
k
UK UsersNon-UK Users
A total of 196 new BaBar users registered since December 2001
11th April 2003 Tim Adye 4
Kanga Disk Saga• In December we had filled up all ~20 TB at RAL• Freed up some space by deleting (most) old
Series-8 data and started importing the backlog• A minor upgrade of our old data server on 19
Feb, csfsun02, prompted a major loss of data• Recovered
• 1.3 TB scavenged from csfsun02 disks• 1.4 TB re-imported from SLAC disk• 0.3 TB restored from SLAC HPSS
• Half way through recovering, discovered that csfsun02 was still bad.• All data migrated to borrowed servers.
• All Kanga data restored and up-to-date with SLAC production on 28 March.
11th April 2003 Tim Adye 5
Security Incident• SucKIT Linux root exploit has been spreading
throughout the HEP community• An infected machine records all passwords typed
on that machine• Includes passwords used to connect to other machines• ssh included; fortunately not klog
• It’s not unlikely that CSF passwords have been compromised by another system
• To protect CSF from further attack, all passwords that have been used recently were reset Tuesday• Users contacted by phone and post• I can give you your new password today
11th April 2003 Tim Adye 6
Linux Upgrade• Nearly all machines at RAL now run RedHat
7.2• Exceptions are
• babar-old.gridpp.rl.ac.uk front-end (AKA csfc)• Will be switched off next week
• babarbuild batch queue• RH72 batch workers can run RH6 jobs, but
RH72 machines can’t build code in release analysis-13 and before, so• Upgrade to analysis-13b or later• Use the babarbuild queue to compile and link; run
in the normal queues
11th April 2003 Tim Adye 7
CSF Batch System
• Much work behind the scenes• Reliability and optimising queuing algorithms
• Use bbrbsub to submit, eg.bbrbsub -l cput=01:00:00 BetaApp myAnalysis.tcl
• bbrbsub is a wrapper for qsub, so you can use qsub options (see “man qsub”)
11th April 2003 Tim Adye 8
Recently Planned Improvements – 1Since November
• Install dedicated import-export machines• Fast (Gigabit) network connection• Special firewall rules to allow scp, bbftp, bbcp, etc.
Two new RH72 Linux machinescsfmove01.rl.ac.uk for exports
• AFS authentication improvements• PBS token passing and renewal• integrated login (AFS token on login, like SLAC)
Not yet implemented
11th April 2003 Tim Adye 9
• Objectivity support• Works now for private federations, but no data import• First step will be to provide Objy conditions database
accessObjy conditions snapshot installed by
Tim Barrass…Then we lost our Objy server, csfsun02
• Upgrade Suns to Solaris 8 and integrate into PBS4 x 4-CPU Solaris 8 systems now available
in babarsol queue, eg.• bbrbsub –q babarsol job.sh
Recently Planned Improvements – 2Since November
11th April 2003 Tim Adye 10
• Support Grid “generic accounts”, so special RAL user registration is no longer necessary
Users without an entry in thegrid-mapfile will be assigned to babar001, babar002, … babar050
The pool account will forever more be bound to that certificate DN, so you will always run under the same babar0NN
Recently Planned Improvements – 3Since November
11th April 2003 Tim Adye 11
Support
• For help, post to “RAL Tier A” HyperNews forum; or
• contact Emmanuel Olaiya (at SLAC) or me (at RAL)