24
INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Embed Size (px)

Citation preview

Page 1: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

INFN-T1 site report

Andrea ChiericiOn behalf of INFN-T1 staff

HEPiX Spring 2014

Page 2: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Andrea Chierici 2

Outline

• Common services• Network• Farming• Storage

20/05/2013

Page 3: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Common services

Page 4: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Andrea Chierici 4

Cooling problem in march

• Problem at cooling system, we had to switch the whole center off– Obviously the problem happened on

Sunday at 1am • Took almost a week to completely

recover and have our center 100% back on-line– But LHC exp. opened after 36h

• We learned a lot from this (see separate presentation)

20/05/2013

Page 5: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Andrea Chierici 5

New dashboard

20/05/2013

Page 6: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Andrea Chierici 6

Example: Facility

20/05/2013

Page 7: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Andrea Chierici 7

Installation and configuration

• CNAF seriously evaluating to move to puppet + foreman as common installation and configuration infrastructure

• INFN-T1 historically a quattor supporter• New man power, wider user base and

activities pushing us to change• Quattor would stay around as much as needed – at least 1 year to allow for the migration of some

critical services20/05/2013

Page 8: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Andrea Chierici 8

Heartbleed

• No evidence of compromised nodes

• Updated SSL and certificates on bastions hosts and critical services (grid nodes, Indico, wiki)

• Some hosts were not exposed due to older version installed

20/05/2013

Page 9: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Andrea Chierici 9

Grid Middleware status

• EMI-3 update status– All core services updated – All WNs updated– Some legacy services (mainly UIs) still at EMI-1/2,

will be phased out asap

20/05/2013

Page 10: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Network

Page 11: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Andrea Chierici 11

WAN Connectivity

NEXUSCisco7600

RALPICTRIUMPHBNLFNALTW-ASGCNDFGF

IN2P3SARA

T1 resources

LHC ONE

LHC OPN

General IPGARR Bo1

40G

b/s10Gb/s

10 Gb/s CNAF-FNALCDF (Data Preservation)

40 Gb Physical Link (4x10Gb)shared for LHCOPN and LHCONE.

10Gb/s

10 Gb/s For General IP Connectivity

20/05/2013

Page 12: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Andrea Chierici 12

Current connection modelINTERNETLHCOPN/ONE

cisco7600

bd8810nexus7018

10Gb/s

10Gb/s

Disk Servers

Farming Switch

Worker Nodes

4X1Gb/sOld resources 2009-2010

Farming Switch

20 Worker Nodes per switch

2x10Gb/sUp to 4x10Gb/s

• Core switches and routers are fully redundant (power, CPU, fabrics)• Every Switch is connected with load sharing on different port modules• Core switches and routers have a strict SLA (next solar day) for maintenance

20/05/2013

4X10Gb/s10Gb/s

Page 13: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Farming

Page 14: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Andrea Chierici 14

Computing resources

• 150K HS-06– Reduced compared to last WS– Old nodes have been phased-out

(2008 and 2009 tender)

• Whole farm running on SL6– Supporting a few VOs that still require sl5 via

WNODeS

20/05/2013

Page 15: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Andrea Chierici 15

New CPU tender

• 2014 tender delayed– Funding issues– We were running over-pledged resources

• Trying to take into account TCO (energy consumption) not only sales price

• Support will cover 4 years• Trying to open it as much as possible

– Last tender only 2 bidders– “Relaxed” support constrains

• Would like to have a way to easily share specs, experiences and hints about other sites procurements

20/05/2013

Page 16: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Andrea Chierici 16

Monitoring & Accounting (1)

20/05/2013

Page 17: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Andrea Chierici 17

Monitoring & Accounting (2)

20/05/2013

Page 18: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Andrea Chierici 18

New activities (last ws)

• Did not migrate to Grid Engine, we stick to LSF– Mainly INFN-wide decision– Man power

• Testing zabbix as a platform for monitoring computing resources– More time required

• Evaluating APEL as an alternative to DGAS for grid accounting system not done yet

20/05/2013

Page 19: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Andrea Chierici 19

New activities

• Configure Ovirt cluster to manage service VMs done– standard libvirt mini-cluster for backup, with GPFS shared storage

• Upgrade LSF to v.9• Setup of a new HPC cluster (Nvidia GPUs + Intel MIC)• Multicore task force• Implement log analysis system (logstash, kibana)• Move some core grid services to OpenStack infrastructure

(first one will be site-BDII)• Evaluation of Avoton CPU (see separate presentation)• Add more VOs to WNODeS

20/05/2013

Page 20: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Storage

Page 21: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Andrea Chierici 21

Storage Resources• Disk Space: 15 PB-N (net) on-line

– 4 EMC2 CX3-80 + 1 EMC2 CX4-960 (~1,4 PB) + 80 servers (2x1 gbps connections)– 7 DDN S2A 9950 + 1 DDN SFA 10K + 1 DDN SFA 12K(~13.5PB) + ~90 servers (10 gbps)– Upgrade of the latest system (DDN SFA 12K) was completed 1Q 2014. Aggregate bandwidth:

70 GB/s

• Tape library SL8500 ~16 PB on line with 20 T10KB drives, 13 T10KC drives and 2 T10KD drives – 7500 x 1 TB tape capacity, ~100MB/s of bandwidth for each drive– 2000 x 5 TB tape capacity, ~200MB/s of bandwidth for each driveThe 2000 tapes can be ‘‘re-used’’ with the T10KD tech with 8.5 TB tape capacity– Drives interconnected to library and servers via dedicated SAN (TAN). 13 Tivoli Storage

manager HSM nodes access the shared drives– 1 Tivoli Storage Manager (TSM) server common to all GEMSS instances

• A tender for additional 3000 x 5TB/8.5TB tape capacity for 2014-2017 is ongoing

• All storage systems and disk-servers on SAN (4Gb/s or 8Gb/s)

20/05/2013

Page 22: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Andrea Chierici 22

Storage Configuration

• All disk space is partitioned in ~10 GPFS clusters served by ~170 servers– One cluster per main experiment (LHC)– GPFS deployed on SAN implements a full High Availability system– System scalable to tens of PBs and able to serve thousands of

concurrent processes with an aggregate bandwidth of tens GB/s

• GPFS coupled with TSM offers a complete HSM solution: GEMSS

• Access to storage granted through standard interfaces (posix, SRM, XRootD and WebDAV)– FS directly mounted on WNs

20/05/2013

Page 23: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Andrea Chierici 23

Storage research activities

• Studies on more flexible and user-friendly methods for accessing storage over WAN– Storage federations, based on http/WebDAV for Atlas

(production) and LHCb (testing)– Evaluation of different file systems (CEPH) and storage

solutions (EMC2 Isilon over OneFS).• Integration between GEMSS Storage System and

Xrootd in order to match the requirements of CMS, Atlas, Alice and LHCb using ad-hoc Xrootd modifications– This is currently in production

20/05/2013

Page 24: INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

Andrea Chierici 24

LTDP• Long Term Data preservation (LTDP) for CDF experiment

– FNAL-CNAF Data Copy Mechanism is completed

• Copy of the data will follow this timetable:– end 2013 - early 2014 → All data and MC user level n-tuples (2.1 PB)– mid 2014 → All raw data (1.9 PB) + Databases

• Bandwidth of 10 Gb/s reserved on transatlantic Link CNAF ↔ FNAL• 940 TB already at CNAF• code preservation: CDF legacy software release (SL6) under test• analysis framework: in the future, CDF services and analysis computing

resources will possibly be instantiated on demand on pre-packaged VMs in a controlled environment

20/05/2013