Upload
brandon-morrison
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
INFN-T1 site report
Andrea ChiericiOn behalf of INFN-T1 staff
HEPiX Spring 2014
Andrea Chierici 2
Outline
• Common services• Network• Farming• Storage
20/05/2013
Common services
Andrea Chierici 4
Cooling problem in march
• Problem at cooling system, we had to switch the whole center off– Obviously the problem happened on
Sunday at 1am • Took almost a week to completely
recover and have our center 100% back on-line– But LHC exp. opened after 36h
• We learned a lot from this (see separate presentation)
20/05/2013
Andrea Chierici 5
New dashboard
20/05/2013
Andrea Chierici 6
Example: Facility
20/05/2013
Andrea Chierici 7
Installation and configuration
• CNAF seriously evaluating to move to puppet + foreman as common installation and configuration infrastructure
• INFN-T1 historically a quattor supporter• New man power, wider user base and
activities pushing us to change• Quattor would stay around as much as needed – at least 1 year to allow for the migration of some
critical services20/05/2013
Andrea Chierici 8
Heartbleed
• No evidence of compromised nodes
• Updated SSL and certificates on bastions hosts and critical services (grid nodes, Indico, wiki)
• Some hosts were not exposed due to older version installed
20/05/2013
Andrea Chierici 9
Grid Middleware status
• EMI-3 update status– All core services updated – All WNs updated– Some legacy services (mainly UIs) still at EMI-1/2,
will be phased out asap
20/05/2013
Network
Andrea Chierici 11
WAN Connectivity
NEXUSCisco7600
RALPICTRIUMPHBNLFNALTW-ASGCNDFGF
IN2P3SARA
T1 resources
LHC ONE
LHC OPN
General IPGARR Bo1
40G
b/s10Gb/s
10 Gb/s CNAF-FNALCDF (Data Preservation)
40 Gb Physical Link (4x10Gb)shared for LHCOPN and LHCONE.
10Gb/s
10 Gb/s For General IP Connectivity
20/05/2013
Andrea Chierici 12
Current connection modelINTERNETLHCOPN/ONE
cisco7600
bd8810nexus7018
10Gb/s
10Gb/s
Disk Servers
Farming Switch
Worker Nodes
4X1Gb/sOld resources 2009-2010
Farming Switch
20 Worker Nodes per switch
2x10Gb/sUp to 4x10Gb/s
• Core switches and routers are fully redundant (power, CPU, fabrics)• Every Switch is connected with load sharing on different port modules• Core switches and routers have a strict SLA (next solar day) for maintenance
20/05/2013
4X10Gb/s10Gb/s
Farming
Andrea Chierici 14
Computing resources
• 150K HS-06– Reduced compared to last WS– Old nodes have been phased-out
(2008 and 2009 tender)
• Whole farm running on SL6– Supporting a few VOs that still require sl5 via
WNODeS
20/05/2013
Andrea Chierici 15
New CPU tender
• 2014 tender delayed– Funding issues– We were running over-pledged resources
• Trying to take into account TCO (energy consumption) not only sales price
• Support will cover 4 years• Trying to open it as much as possible
– Last tender only 2 bidders– “Relaxed” support constrains
• Would like to have a way to easily share specs, experiences and hints about other sites procurements
20/05/2013
Andrea Chierici 16
Monitoring & Accounting (1)
20/05/2013
Andrea Chierici 17
Monitoring & Accounting (2)
20/05/2013
Andrea Chierici 18
New activities (last ws)
• Did not migrate to Grid Engine, we stick to LSF– Mainly INFN-wide decision– Man power
• Testing zabbix as a platform for monitoring computing resources– More time required
• Evaluating APEL as an alternative to DGAS for grid accounting system not done yet
20/05/2013
Andrea Chierici 19
New activities
• Configure Ovirt cluster to manage service VMs done– standard libvirt mini-cluster for backup, with GPFS shared storage
• Upgrade LSF to v.9• Setup of a new HPC cluster (Nvidia GPUs + Intel MIC)• Multicore task force• Implement log analysis system (logstash, kibana)• Move some core grid services to OpenStack infrastructure
(first one will be site-BDII)• Evaluation of Avoton CPU (see separate presentation)• Add more VOs to WNODeS
20/05/2013
Storage
Andrea Chierici 21
Storage Resources• Disk Space: 15 PB-N (net) on-line
– 4 EMC2 CX3-80 + 1 EMC2 CX4-960 (~1,4 PB) + 80 servers (2x1 gbps connections)– 7 DDN S2A 9950 + 1 DDN SFA 10K + 1 DDN SFA 12K(~13.5PB) + ~90 servers (10 gbps)– Upgrade of the latest system (DDN SFA 12K) was completed 1Q 2014. Aggregate bandwidth:
70 GB/s
• Tape library SL8500 ~16 PB on line with 20 T10KB drives, 13 T10KC drives and 2 T10KD drives – 7500 x 1 TB tape capacity, ~100MB/s of bandwidth for each drive– 2000 x 5 TB tape capacity, ~200MB/s of bandwidth for each driveThe 2000 tapes can be ‘‘re-used’’ with the T10KD tech with 8.5 TB tape capacity– Drives interconnected to library and servers via dedicated SAN (TAN). 13 Tivoli Storage
manager HSM nodes access the shared drives– 1 Tivoli Storage Manager (TSM) server common to all GEMSS instances
• A tender for additional 3000 x 5TB/8.5TB tape capacity for 2014-2017 is ongoing
• All storage systems and disk-servers on SAN (4Gb/s or 8Gb/s)
20/05/2013
Andrea Chierici 22
Storage Configuration
• All disk space is partitioned in ~10 GPFS clusters served by ~170 servers– One cluster per main experiment (LHC)– GPFS deployed on SAN implements a full High Availability system– System scalable to tens of PBs and able to serve thousands of
concurrent processes with an aggregate bandwidth of tens GB/s
• GPFS coupled with TSM offers a complete HSM solution: GEMSS
• Access to storage granted through standard interfaces (posix, SRM, XRootD and WebDAV)– FS directly mounted on WNs
20/05/2013
Andrea Chierici 23
Storage research activities
• Studies on more flexible and user-friendly methods for accessing storage over WAN– Storage federations, based on http/WebDAV for Atlas
(production) and LHCb (testing)– Evaluation of different file systems (CEPH) and storage
solutions (EMC2 Isilon over OneFS).• Integration between GEMSS Storage System and
Xrootd in order to match the requirements of CMS, Atlas, Alice and LHCb using ad-hoc Xrootd modifications– This is currently in production
20/05/2013
Andrea Chierici 24
LTDP• Long Term Data preservation (LTDP) for CDF experiment
– FNAL-CNAF Data Copy Mechanism is completed
• Copy of the data will follow this timetable:– end 2013 - early 2014 → All data and MC user level n-tuples (2.1 PB)– mid 2014 → All raw data (1.9 PB) + Databases
• Bandwidth of 10 Gb/s reserved on transatlantic Link CNAF ↔ FNAL• 940 TB already at CNAF• code preservation: CDF legacy software release (SL6) under test• analysis framework: in the future, CDF services and analysis computing
resources will possibly be instantiated on demand on pre-packaged VMs in a controlled environment
20/05/2013