11
11th April 2003 Tim Adye 1 RAL Tier A RAL Tier A Status Status Tim Adye Rutherford Appleton Laboratory BaBar UK Collaboration Meeting Liverpool 11 th April 2003

11th April 2003Tim Adye1 RAL Tier A Status Tim Adye Rutherford Appleton Laboratory BaBar UK Collaboration Meeting Liverpool 11 th April 2003

Embed Size (px)

DESCRIPTION

11th April 2003Tim Adye3 BaBar Batch Users at RAL (running at least one non-trivial job each week)

Citation preview

Page 1: 11th April 2003Tim Adye1 RAL Tier A Status Tim Adye Rutherford Appleton Laboratory BaBar UK Collaboration Meeting Liverpool 11 th April 2003

11th April 2003 Tim Adye 1

RAL Tier A RAL Tier A StatusStatus

Tim AdyeRutherford Appleton Laboratory

BaBar UK Collaboration MeetingLiverpool

11th April 2003

Page 2: 11th April 2003Tim Adye1 RAL Tier A Status Tim Adye Rutherford Appleton Laboratory BaBar UK Collaboration Meeting Liverpool 11 th April 2003

11th April 2003 Tim Adye 2

BaBar Batch CPU Use at RAL

0

20,000

40,000

60,000

80,000

100,000

120,000

Week Beginning

BaB

ar C

PU

Hou

rs p

er W

eek

(Nor

mal

ised

to P

450)

SPUK UsersNon-UK Users

Full usage at full efficiency of BaBar CPUs = 106,624 Hours/Week; 59,733 according to MOU

Page 3: 11th April 2003Tim Adye1 RAL Tier A Status Tim Adye Rutherford Appleton Laboratory BaBar UK Collaboration Meeting Liverpool 11 th April 2003

11th April 2003 Tim Adye 3

BaBar Batch Users at RAL(running at least one non-trivial job each week)

0

5

10

15

20

25

30

35

40

45

Week Beginning

BaB

ar U

sers

per

Wee

k

UK UsersNon-UK Users

A total of 196 new BaBar users registered since December 2001

Page 4: 11th April 2003Tim Adye1 RAL Tier A Status Tim Adye Rutherford Appleton Laboratory BaBar UK Collaboration Meeting Liverpool 11 th April 2003

11th April 2003 Tim Adye 4

Kanga Disk Saga• In December we had filled up all ~20 TB at RAL• Freed up some space by deleting (most) old

Series-8 data and started importing the backlog• A minor upgrade of our old data server on 19

Feb, csfsun02, prompted a major loss of data• Recovered

• 1.3 TB scavenged from csfsun02 disks• 1.4 TB re-imported from SLAC disk• 0.3 TB restored from SLAC HPSS

• Half way through recovering, discovered that csfsun02 was still bad.• All data migrated to borrowed servers.

• All Kanga data restored and up-to-date with SLAC production on 28 March.

Page 5: 11th April 2003Tim Adye1 RAL Tier A Status Tim Adye Rutherford Appleton Laboratory BaBar UK Collaboration Meeting Liverpool 11 th April 2003

11th April 2003 Tim Adye 5

Security Incident• SucKIT Linux root exploit has been spreading

throughout the HEP community• An infected machine records all passwords typed

on that machine• Includes passwords used to connect to other machines• ssh included; fortunately not klog

• It’s not unlikely that CSF passwords have been compromised by another system

• To protect CSF from further attack, all passwords that have been used recently were reset Tuesday• Users contacted by phone and post• I can give you your new password today

Page 6: 11th April 2003Tim Adye1 RAL Tier A Status Tim Adye Rutherford Appleton Laboratory BaBar UK Collaboration Meeting Liverpool 11 th April 2003

11th April 2003 Tim Adye 6

Linux Upgrade• Nearly all machines at RAL now run RedHat

7.2• Exceptions are

• babar-old.gridpp.rl.ac.uk front-end (AKA csfc)• Will be switched off next week

• babarbuild batch queue• RH72 batch workers can run RH6 jobs, but

RH72 machines can’t build code in release analysis-13 and before, so• Upgrade to analysis-13b or later• Use the babarbuild queue to compile and link; run

in the normal queues

Page 7: 11th April 2003Tim Adye1 RAL Tier A Status Tim Adye Rutherford Appleton Laboratory BaBar UK Collaboration Meeting Liverpool 11 th April 2003

11th April 2003 Tim Adye 7

CSF Batch System

• Much work behind the scenes• Reliability and optimising queuing algorithms

• Use bbrbsub to submit, eg.bbrbsub -l cput=01:00:00 BetaApp myAnalysis.tcl

• bbrbsub is a wrapper for qsub, so you can use qsub options (see “man qsub”)

Page 8: 11th April 2003Tim Adye1 RAL Tier A Status Tim Adye Rutherford Appleton Laboratory BaBar UK Collaboration Meeting Liverpool 11 th April 2003

11th April 2003 Tim Adye 8

Recently Planned Improvements – 1Since November

• Install dedicated import-export machines• Fast (Gigabit) network connection• Special firewall rules to allow scp, bbftp, bbcp, etc.

Two new RH72 Linux machinescsfmove01.rl.ac.uk for exports

• AFS authentication improvements• PBS token passing and renewal• integrated login (AFS token on login, like SLAC)

Not yet implemented

Page 9: 11th April 2003Tim Adye1 RAL Tier A Status Tim Adye Rutherford Appleton Laboratory BaBar UK Collaboration Meeting Liverpool 11 th April 2003

11th April 2003 Tim Adye 9

• Objectivity support• Works now for private federations, but no data import• First step will be to provide Objy conditions database

accessObjy conditions snapshot installed by

Tim Barrass…Then we lost our Objy server, csfsun02

• Upgrade Suns to Solaris 8 and integrate into PBS4 x 4-CPU Solaris 8 systems now available

in babarsol queue, eg.• bbrbsub –q babarsol job.sh

Recently Planned Improvements – 2Since November

Page 10: 11th April 2003Tim Adye1 RAL Tier A Status Tim Adye Rutherford Appleton Laboratory BaBar UK Collaboration Meeting Liverpool 11 th April 2003

11th April 2003 Tim Adye 10

• Support Grid “generic accounts”, so special RAL user registration is no longer necessary

Users without an entry in thegrid-mapfile will be assigned to babar001, babar002, … babar050

The pool account will forever more be bound to that certificate DN, so you will always run under the same babar0NN

Recently Planned Improvements – 3Since November

Page 11: 11th April 2003Tim Adye1 RAL Tier A Status Tim Adye Rutherford Appleton Laboratory BaBar UK Collaboration Meeting Liverpool 11 th April 2003

11th April 2003 Tim Adye 11

Support

• For help, post to “RAL Tier A” HyperNews forum; or

• contact Emmanuel Olaiya (at SLAC) or me (at RAL)