9
Oxford STEP09 Report Ewan MacMahon/ Pete Gronbech HEPSYSMAN RAL 2nd July 2009

Oxford STEP09 Report Ewan MacMahon/ Pete Gronbech HEPSYSMAN RAL 2nd July 2009

Embed Size (px)

Citation preview

Page 1: Oxford STEP09 Report Ewan MacMahon/ Pete Gronbech HEPSYSMAN RAL 2nd July 2009

Oxford STEP09 Report

Ewan MacMahon/ Pete Gronbech

HEPSYSMAN RAL2nd July 2009

Page 2: Oxford STEP09 Report Ewan MacMahon/ Pete Gronbech HEPSYSMAN RAL 2nd July 2009

Oxford STEP09 Report2

Storage access

•Our main lesson from STEP09 was the big change in ATLAS’ use of storage from lots of small files to fewer large ones.•This moved the SE bottleneck from authentication on head node to bandwidth on the disk pools.•We implemented network channel bonding on the pools to immediate effect.

•We had five disk pools taking most of the load; with more servers we should have higher overall bandwidth.

Page 3: Oxford STEP09 Report Ewan MacMahon/ Pete Gronbech HEPSYSMAN RAL 2nd July 2009

Oxford STEP09 Report3

Some pool nodes not stressed

• Some smaller pool nodes were marked read only and hence had no STEP09 data

Page 4: Oxford STEP09 Report Ewan MacMahon/ Pete Gronbech HEPSYSMAN RAL 2nd July 2009

Oxford STEP09 Report4

• Earlier tests stressed the head nodes network more.

• STEP09 had little load on head node, and only sporadic network activity

Page 5: Oxford STEP09 Report Ewan MacMahon/ Pete Gronbech HEPSYSMAN RAL 2nd July 2009

Oxford STEP09 Report5

Data Transfer – small but successful

• Oxford are on an 18% share of AODs.• MCDISK had over 10TB of old recon

Total: 337240 files in 548 datasets, size 10902.36GBc.f. Glasgow where we have 8 recon datasets.

Page 6: Oxford STEP09 Report Ewan MacMahon/ Pete Gronbech HEPSYSMAN RAL 2nd July 2009

Oxford STEP09 Report6

Observations - ATLAS

• Apparently ATLAS were keeping an eLog, we weren’t aware of it (or forgot about it) until the test was essentially over.

• ATLAS pilot jobs were coming in with Role=Production and Role=Pilot. It makes no sense and it breaks things; please stop.

• ATLAS’ pilot factory got wedged by a dead CE. We have a dual redundant pair of CEs feeding the main SL4 system, t2ce02 and t2ce04, one of which died. Initially the factory was only using t2ce02, then, when reconfigured, refused to use t2ce04 because of the backlog of (dead) pilots that had been sent to t2ce02.

• The direct (non rfcp file stager) access kills us; we can’t run many jobs like that at once.

Page 7: Oxford STEP09 Report Ewan MacMahon/ Pete Gronbech HEPSYSMAN RAL 2nd July 2009

Oxford STEP09 Report7

Observations - LHCb

• We were sent very few jobs, as far as we know they ran fine.

• Er. That’s it.

Page 8: Oxford STEP09 Report Ewan MacMahon/ Pete Gronbech HEPSYSMAN RAL 2nd July 2009

Oxford STEP09 Report8

What next?

• It would be useful to run more tests:– on the individual access methods, one at a time,– Then a repeat of the mixture, but with different DNs for each

method.• Likely upgrades:

– More pools online. We’ve been waiting for DPM 1.7 to do some necessary maintenance. Once that’s done we should have about twelve active pool nodes rather than five.

– 10Gb Ethernet? (Not soon, and probably only on new kit)

Page 9: Oxford STEP09 Report Ewan MacMahon/ Pete Gronbech HEPSYSMAN RAL 2nd July 2009

Oxford STEP09 Report9

Conclusions

• Oxford has no local ATLAS expert so we suffered from a lack of awareness of the three different submission methods.

• Have been working with Peter Love to understand why the pilot job submission method was failing.

• We would like to have a post mortem meeting with Brian to make sure we understand what caused our other various problems.