10
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010

Status of the production and news about Nagios

  • Upload
    kale

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

Status of the production and news about Nagios. ALICE TF Meeting 22/07/2010. Summary of the last week production. Large amount of running jobs (expected) Pass 1 reconstruction (LHC10d)activities ongoing Analysis trains and user analysis tasks ongoing 2 MC cycles - PowerPoint PPT Presentation

Citation preview

Page 1: Status of the production and news about  Nagios

Status of the production and news about Nagios

ALICE TF Meeting22/07/2010

Page 2: Status of the production and news about  Nagios

Summary of the last week production

• Large amount of running jobs (expected)– Pass 1 reconstruction (LHC10d)activities ongoing– Analysis trains and user analysis tasks ongoing– 2 MC cycles• LHC10d14: pp events, Pythia6, 900GeV• LHC10d15: pp events, Phojet, 900GeV• 7.4 M of requested events per cycle• Currently 90% completed

– New MC cycles expected during the weekend

Page 3: Status of the production and news about  Nagios

Job profile per sites

Currently decreasing the activity moreover at the T0: Not an issue, well understood

Decrease of running jobs at key T1 sites: FZK and CNAF

Page 4: Status of the production and news about  Nagios

Job profiles per users

MC and reconstruction activities

User analysis activities

Page 5: Status of the production and news about  Nagios

Raw data information

52TB of raw data transferred to T0

T0-T1 raw data transfers: up to 270MB/s achieved

Page 6: Status of the production and news about  Nagios

Site news (T0 site)

• CERN– Cooling problem at the IT this week affecting the experiment

voboxes declared with an importance below 50• In the case of ALICE: Npone of the production VOBOXES nor the

xrootd redirectors were affected (importance = 90)• CAF nodes affected (nodes were switched off). CAF users

prevented. The importance of all CAF nodes (nodes and PROOF master) have been already increased to 50

– CREAM-CE: Better performance and stability of the systems this week

– Transparent upgrade of CASTOr2 (including xrootd) on the 21st of July

Page 7: Status of the production and news about  Nagios

Site news (T1 sites)• CNAF

– Job profile: problems with the information reported by the resource BDII twice this week• Up to 11K Alice agents waiting in the queues

– SE disk space for ALICE increased. 546T available (32% used by the 21st July)– Today the SE was reporting some problems in ML: SOLVED by Francesco

• All server services have been restarted

• SARA– SE in scheduled downtime this week. Upgrade of dCache– System back in production and good performance in ML

• LYON– SE still under tuning– the configuration is set up and writing on the storage disks works however

there's a problem with the migration of the disk data towards HPSS (rfio issue)

Page 8: Status of the production and news about  Nagios

Sites news (T2 sites)

• Torino– Migrating CREAM system to the latest version

• Madrid– Same operation as in Torino

• Cyfronet– Lack of available resources. Waiting for the site

admin to increase them

Page 9: Status of the production and news about  Nagios

Site summary

• Hardly going over the 20K concurrent jobs– Cooling problems at FZK last week– Info system issues with CNAF– High load of other experiment at Lyon

• Several sites have seen this week a high number of jobs running over 46h– Pathological jobs. Although finishing correctly, their

outputs cannot be used– Prevention measure: Set the CPU time of the ALICE

queues to a 24h limit

Page 10: Status of the production and news about  Nagios

Nagios news• Currently publishing in the VALIDATION infrastructure 2 sensors:

VOBOXES and CE• Some of the discrepancies found between the SAM and Nagios

results: SOLVED• SITES SHOULD KNOW:

– The voboxes MUST be published in the GOCDB and the BDII– Voboxes MUST be pingable from samnag014.cern.ch

• Standard Nagios test• Requested by Alice about one month ago through this meeting

– We all together will need to define when we should put this infrastructure in production• This needs to be announced at the next MB meeting on the 27th• This implies the deprecation of SAM