15
Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010

Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010

Embed Size (px)

Citation preview

Page 1: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010

Northgrid Status

Alessandra FortiGridpp24 RHUL15 April 2010

Page 2: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010

Outline

• Apel pies• Lancaster status• Liverpool status• Manchester status• Sheffield• Conclusions

Page 3: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010

Apel pie (1)

Page 4: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010

Apel pie (2)

Page 5: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010

Apel pie (3)

Page 6: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010

Lancaster

• All WN moved to tarball• Moving all nodes to SL5 solved “sub-cluster”

problems.• Deployed and decommissioned a test SCAS.

– Will install glexec when user demand it

• In the middle of deploying CREAM CE• Finished tendering for the HEC facility

– Will give us access to 2500 cores– Extra 280 TB of storage– Shared Facility has Roger Jones as director so

we have a strong voice for GridPP interests

Page 7: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010

Lancaster

• Older storage nodes are being re-tasked• Tarball WN are working well but YAIM is

suboptimal to configure them• Maui continues to be weird for us

– Jobs blocking other jobs– Confused by multiple queues– Jobs don't use their reservations when they are

blocked

• Problems trying to use the same NFS server for experiment software and tarballs.– Now they have been split

Page 8: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010

Liverpool

• What we did (we were supposed to do)– Major hardware procurement

• 48TB unit with 4Gbit bonded link• 7X4X8 units = 224 cores, 3GB mem, 2x1TB

disk

– Scrapped some 32bit nodes– CREAM test CE running

• Other things we did– General guide to capacity publishing– Horizontal job allocation– Improved use of Vms– Grid use of slack local HEP nodes

Page 9: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010

Liverpool

• Things in progress– Put CREAM in GOCDB (ready)– Scrap all 32 bit nodes (gradually)– Production runs of central computer cluster

(other dept involved)

• Problems– Obsolete equipment– WMS/ICE fault at RAL

• What's next– Install/deploy newly procured storage and CPU

hardware– Achieve production runs of central computing

cluster

Page 10: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010

Manchester

• Since last time– Upgraded WN to SL5– Eliminated all dcache setup from the nodes– Raid0 on internal disks– Increased scratch area– Unified two DPM instances– 106 TB/84 dedicated to atlas– Upgraded to 1.7.2– Changed network configuration of data servers– Installed squid cache– Installed Cream CE (still in test phase)– Last HC test in March 99% efficiency

Page 11: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010

Manchester

• Major UK site in atlas production 2 or 3 after RAL and Glasgow

• Last HC in March had 99% efficiency• 80 TB almost empty

– Not many jobs– But from the stats of the past few days also

real users seem also fine.

96%

Page 12: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010

Manchester

• Tender– European Tender submitted 15/9/2009– Vendors replies should be in 16/04/2010 (in

two days)– Additional GridPP3 money can be added

• Included a clause for increased budget

– Minimum requirements 4400 HEPSPEC/240TB• Can be exceeded• Buying only nodes

– Talking to Uni for Green funding to replace what we can't replace• Not easy

Page 13: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010

Sheffield

• Storage Upgrade– Storage moved to physics: 24/7 access– All nodes running SL5, DPM 1.7.3– 4x25TB disk pools, 2TB disks, RAID5, 4

cores– Memory will be upgaded to 8GB on all nodes– 95% reserved for atlas– Xfs crashed, problem solved with additional

kernel module

– Sw server 1TB (raid1)– Squid server

Page 14: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010

Sheffield

• Worker Nodes– 200 old 2.4GHz, 2GB, SL5– 72 TB of local disk per 2 cores– Lcg-CE and MONBOX on SL4– Additional 32 amp ring has been

added – Fiber link between CICS and physics

• Availability– 97-98% since January 2008– 94.5% efficiency in atlas

Page 15: Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010

Sheffield Plans

• Additional storage– 20TB → bring total 120TB for atlas

• Cluster integration– Local HEP and UKI-NORTHGRID-SHEF-HEP

will have joint Wns– 128 CPU + 72 new nodes ???– Torque server from local cluster and lcg-CE

from grid cluster– Need 2 days DT waiting for atlas approval– CREAM CE installed waiting to complete

cluster integration