36
Quattor in CMS (a CMS for CMS) J.A. Coarasa CERN, Geneva, Switzerland for the CMS TriDAS group. 11 th Quattor Workshop, 16-18 March 2011, CERN, Geneva

Quattor in CMS (a CMS for CMS)

  • Upload
    kreeli

  • View
    100

  • Download
    0

Embed Size (px)

DESCRIPTION

Quattor in CMS (a CMS for CMS). J.A. Coarasa CERN, Geneva, Switzerland for the CMS TriDAS group. 11 th Quattor Workshop, 16-18 March 2011, CERN, Geneva. Outline. The environment: CMS the CMS Online Cluster. The Quattor installation Insfrastructure Anatomy of a profile - PowerPoint PPT Presentation

Citation preview

Page 1: Quattor in CMS  (a CMS for CMS)

Quattor in CMS (a CMS for CMS)

J.A. Coarasa CERN, Geneva, Switzerland

for the CMS TriDAS group.

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

Page 2: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

Outline

• The environment: – CMS – the CMS Online Cluster.

• The Quattor installation– Insfrastructure– Anatomy of a profile– The tools around Quattor (some examples)

• The template summarizer• The software updater• The tools for the cluster

• Summary

J.A. Coarasa 2

Page 3: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

CMS design parameters

Detector Channels Ev. DataPixel 60000000 50 (kB)Tracker 10000000 650Preshower 145000 50ECAL 85000 100HCAL 14000 50Muon DT 200000 10Muon RPC 200000 5Muon CSC 400000 90

Detector Channels Ev. DataPixel 60000000 50 (kB)Tracker 10000000 650Preshower 145000 50ECAL 85000 100HCAL 14000 50Muon DT 200000 10Muon RPC 200000 5Muon CSC 400000 90

Detectors

J.A. Coarasa 3

Page 4: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

Requirements and implications: General

The IT infrastructure (computing and networking) of the CMS Online Cluster is responsible for the CMS data acquisition and experiment control. The requirements were:

• Autonomous (i.e. independent from all other networks, CERN campus network included) uninterrupted operation 24/7 on two far apart (~200 m) physical locations, with one Control Room;

⇒All IT infrastructure and services must be local and redundant.

⇒Strict security is implied.• Remote control and monitoring of computers is necessary.• Fast configuration turnaround required due to evolving nature

of applications during commissioning phase;• Scalable services design to accommodate future expansions;• Serving the needs of a community of more than 900 Users.

⇒Some level of user configuration is required/mandatory.

J.A. Coarasa 4

Page 5: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

CMS DAQ challenge

• Unprecedented Data Volumes– Reduction 1 in 100 000 online – Level 1 Maximum trigger rate 100 kHz ; event size 1 MByte– Event Building 1 Terabit/s ~700 sources, 2000 destinations – High Level Trigger (~offline SW): selectivity 1 in 1000

• Strategy: invest in commercial processing and network technologies (TP 1994)

Detector Front-end

Computing Services

ReadoutSystems

Builder and Filter

Systems

Event Manager

Builder Networks

Level 1Trigger

RunControl

40 MHz

100 kHz

100 Hz

J.A. Coarasa 5

Page 6: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

Requirements and implications: DAQ Readout, Event building, HLT, Storage

• Data Network capable of reading from electronics at 100 kHz (~100 GBytes/s);

⇒Computers reading from electronics need myrinet networking.• Data Network and sufficient computing power to run the second

(high) level trigger software to select a maximum of 2 GBytes/s from the 100 GBytes/s;

⇒At least 2500 cores (2.5 GHz).

⇒High bandwidth networking.• Enough local storage to operate for 2 days without connection

to Tier 0;

⇒At least 300 TBytes of local storage.• Capacity of transferring at most 2 GBytes/s to “Central Data

Center” (Tier 0);

J.A. Coarasa 6

Page 7: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

Constraints and implications• Limited man power (~5 FTE):

⇒Automatic procedures where possible.• Harsh operational conditions:

– Unexpected cooling failures and power cuts;– UPS protects only the central servers;

⇒Automatic shutdown with rising temperature.• Computers connected to the electronics have to be

swiftly replaced in case of failure (Other computers, like HLT, run fault tolerant software…);

⇒Need for spares in location.

⇒Fast turnaround in reinstallation and/or reconfiguration.

J.A. Coarasa 7

Page 8: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

• More than 2500 computers mostly running Scientific Linux CERN 4/5:

– 640 to read out from the electronics, equipped with 2 Myrinet and 3 independent 1 Gbit Ethernet lines for data networking;

– 720 with 8 CPU cores (5760) + 288 (being commissioned) with 12 CPU cores (3456) and as high level trigger computers with 2 Gbit Ethernet lines for data networking;

– 16 with access to 300 TBytes of FC storage, 4 Gbit Ethernet lines for data networking and 2 additional ones for networking to Tier 0;

– More than 350 used by subdetectors and to control electronics (includes 90 Windows and miscellaneous hardware);

– 12 as an ORACLE RAC;

– 15 as CMS control computers;

– 50 as desktop computers in the control rooms;

– 200 for commissioning and testing in a partly replicated setup;

– 20 as infrastructure and access servers;

– More than 200 active spare computers;

The Solution: IT infrastructure. Computing

J.A. Coarasa 8

Hig

h ba

ndw

idth

ne

twor

king

Page 9: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

The Solution: IT infrastructure. Networking

• 14 Myrinet switches• 9 Force 10 E1200

High concentration (up to 1260 ports) 1 Gbit Ethernet switches

• ~100 1Gbit Ethernet switches• CMS Networks:

– Public CERN Network

(GPN);

– Private Networks:• Service Networks;

• Data Network– Souce routing on computers– VLANs on switches

• Central Data Recording

(CDR). Network to Tier 0.

• NetApp Network Attached Storage filer

J.A. Coarasa 9

CMS Networks

CERN NetworkComputer gateways

Data Network

Readout, HLT Storage Manager

CDR Network

Control…

Service Network

Firewall

Internet

Page 10: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

Redundancy:The Network Attached Storage.

• Our important data is hosted in a Network Attached Storage:– User home directories;

– CMS data: calibration, data quality monitoring…;

– Repositories and configuration management data;

– Admin data.

• 2 NetApp filer heads in failover configuration (in two racks);

• With 3 mirrored storage drawers (6 in total) with internal Dual Parity RAID 6;

• And the snapshot feature active (saves as from going to Backup).

• Tested throughput > 380 MBytes/s.

J.A. Coarasa 104

Gb

it

2x1

0G

bit

redu

nd

ant

NAS

Page 11: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

Redundancy and Load balancing for the CMS IT Structural Services. The Concept

• Pattern to provide redundancy:– 1 master + N slave/replicas

(now N=3 for most of the services)

hosted in different racks;

⇒Easy scalability.

⇒Needs replication for all services.– Services working under DNS alias

where possible.

⇒Allows to move the service.

⇒No service outage.– Load balancing of primary server

for client:• DNS Round Robin;

• explicit client configuration

segregating in groups of computers.J.A. Coarasa 11

1 master

N slave/replicas

Page 12: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

Redundancy and Load balancing for the CMS IT Structural Services. The Concept

• Pattern to provide redundancy:– 1 master + N slave/replicas

(now N=3 for most of the services)

hosted in different racks;

⇒Easy scalability.

⇒Needs replication for all services.– Services working under DNS alias

where possible.

⇒Allows to move the service.

⇒No service outage.– Load balancing of primary server

for client:• DNS Round Robin;

• explicit client configuration

segregating in groups of computers.J.A. Coarasa 12

1 master

N slave/replicas

Explicit client configuration segregating in groups of computers

Page 13: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

CMS Management and Configuration Infrastructure: The tools

• IPMI (Intelligent Platform Management Interface) is used to manage the computers remotely:– reboot, console access,…;

• PXE and anaconda kickstart through http are used as bootstrap installation method;

• Quattor (QUattor is an Administration ToolkiT for Optimizing

Resources) is used as the configuration management system;

⇒All Linux computers configured through it or rpms distributed with it (even the Quattor servers themselves): BIOS, all Networking parameters…

J.A. Coarasa 13

Page 14: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

CMS Quattor infrastructure I• Based on Quattor 1.3• CDB (cdb 2.0.4, PANC 6.0.8) (16 Gbyte, 8 core)– cmscdb-01 + 2 standby: cmscdb-02 cmscdb-03 (active selected

through a dns alias)– CDB data hosted on filer

⇒Nothing special on the active server. In few minutes the active can be a different one.

• Repositories (swrep 2.1.38)– 6 computers DNS Round Robin load balanced– swrep.cms: CERN IT’s offline copy of SLC4/SLC5– cmsswrep, with “zones”:

• /cms_system • /cms_cdaq • /cms_rcms

• /cms_cmssw • /cms_subdet • /cms_ecal

– Repositories hosted on filer• At installation time the repository computers act as cache (the filer

is not stressed)J.A. Coarasa 14

Page 15: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

CMS Quattor infrastructure II• It uses in-house:

– restricted format in templates: “hierarchical”+other conventions;

– areas for cdb and swrep to define subdetector software and versioning in them;

⇒Allowed in-house easy developments:• Template summarizer/“inventory maker”

http://cmsdaq0.cern.ch/cmscdb

• Dropbox for rpms• Template updater

J.A. Coarasa 15

Page 16: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

CMS Quattor performance• ~2500 computers managed in ~80 types

– PANC compilation takes less than 4 min for 1050, 8 min for 2100… due to cdb.conf tunning:

• 150 computers per process• 7 processes maximum configuration• 2.2 Gbytes maximum memory taken per process

– Notification active, spreading time less than 10 min depending on _type_

• Limitation comes from the available bandwidth to the repository servers (currently 6 Gbit) and inability of spma to retry after http timeout

⇒Cluster-wide reconfiguration takes more than 10 min

– Reinstallation takes 6-22 minutes per computer– Reinstallation of 1000 computers in 1 ¼ hour≲

J.A. Coarasa 16

Page 17: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

profile anatomy

J.A. Coarasa 17

Page 18: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

_type_ anatomy

J.A. Coarasa 18

template pro_type_cdaq_bufu_slc5_x86_64;

"/system/cluster/name" = default( "cdaq_ruhardware_slc4_32" ); include cms_pro_software_slc5_x86_64;

variable loadpath = list("cms_cdaq_slc5_64","cms_cmssw_slc5_64"); include cms_cdaq_boost_pro_software_slc5_x86_64; […]include cms_cdaq_filter_pro_software_slc4_32; include cms_cmssw_general_pro_software_slc4_32;

include cms_pro_system; include cms_pro_kernel_version_slc5_x86_64;

include cms_cmsnet_pro_system_slc5_x86_64; include cms_cdaq_pro_system;

include cms_pro_system_acl; include cms_cdaq_ruhardware_pro_system_acl; include cms_autofs_hilton_pro_system;

Page 19: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

Workarounds for Quattor Limitations in our setup

• Sometimes the components are buggy (specially in removal of configuration)

• Sometimes they do not exist

Examples of workarounds• Copy configuration files to individual

computers (copyd)• Data networks configuration done through in-

house development• rpms are used

– To do some configuration (SLP…)– To create local users

J.A. Coarasa 19

Page 20: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

The Template Summarizer (Functionality)

– Gives us an up to date customized overview of the cluster:

J.A. Coarasa 20

Page 21: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

The Template Summarizer (Functionality)

J.A. Coarasa 21

Page 22: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

The Template Summarizer (Functionality)

J.A. Coarasa 22

Page 23: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

The Template Summarizer (Functionality)

J.A. Coarasa 23

Page 24: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

The Template Summarizer (Functionality)

J.A. Coarasa 24

Page 25: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

The Software Updater Tools (Functionality)

• They can update (up or downgrade) an existing RPM in a group of computers.

• They can rollback the changes you made.

• They let you check the status of the update in your computers.

J.A. Coarasa 25

Page 26: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

The Software Updater Tools (Limitations)

• Only one person can run it at a time. You will be told who is running it.

• You can not add an rpm that was never in your templates. For this, you still need to create a savannah request.

• The configuration of Quattor has to be put to non-permissive. – You can still install rpms manually.

J.A. Coarasa 26

Page 27: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

The Software Updater Tools (Howto)

• Log in into the cmsdropbox computer

• Create a directory in your working directory with the name of your area

• Copy your rpms into this directory

• Run the software updater and follow the instructions

• Wait a bit (currently ~10 min) and run the command to check the status

J.A. Coarasa 27

Page 28: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

What the Updater DoesFirst Step

• Uploads the RPMs in this directory, to the Quattor software repository – First it performs checks and

• Aborts if:– RPMs not named properly;– RPMs already in the repository and different;– 2 versions of the same RPM are given.

• Warns if:– RPMs already in repository but equal.

J.A. Coarasa 28

Page 29: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

What the updater doesSecond Step

• Edits the templates to substitute the RPMs you gave– It will warn if an RPM is not found in the

templates

• If allowed, starts the update on the computers and tells you how to rollback the change

J.A. Coarasa 29

Page 30: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

What the updater doesThird Step

• Tells you how to get the status of the computers affected by the update.– You will have to wait the “distribution time

of updates” (currently 10 minutes)

J.A. Coarasa 30

Page 31: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

An Example session (I)[coarasa@srv-C2C04-15 ~]$ cp ~coarasa/TriDAS/sysadmin/RemoteDropboxAndUpdate/RemoteDropboxAndUpdate-1.0.0-1.noarch.rpm cms_general/

[coarasa@srv-C2C04-15 ~]$ /usr/local/bin/UpdateTemplatesWithRPMs.sh cms_general

25043 INFO: Creating /tmp/UpdateTemplatesWithRPMs.sh-2009-12-07_17:06:00-coarasa-Backup to backup profiles. Use these ones to go back on your change

25043 INFO: I found the following RPMs to update: cms_general/RemoteDropboxAndUpdate-1.0.0-1.noarch.rpm

25043 INFO: Is it correct? Are you sure you want to continue? [no]

y

25043 INFO: 1 Files that will be uploaded: cms_general/RemoteDropboxAndUpdate-1.0.0-1.noarch.rpm

25043 INFO: 1 Successfuly uploaded files (deleted from incoming area): cms_general/RemoteDropboxAndUpdate-1.0.0-1.noarch.rpm

25043 INFO: Continuing to install the following RPMs in computers.

25043 INFO: List of RPMS: cms_general/RemoteDropboxAndUpdate-1.0.0-1.noarch.rpm

25043 INFO: Do you want to install them now? Are you sure you want to continue? [no]

y

25043 INFO: Found the following files on area cms_general: cms_general/RemoteDropboxAndUpdate-1.0.0-1.noarch.rpm

25043 INFO: Modified the following RPMs versions: cms_general/RemoteDropboxAndUpdate-1.0.0-1.noarch.rpm

25043 INFO: I will modify the following templates: cms_general/cms_dropbox_pro_software_slc4_32.tpl

25043 INFO: Are you sure you want to continue? [no]

y

25043 INFO: To rollback to the previous version use the command (one single line):

cd /tmp/UpdateTemplatesWithRPMs.sh-2009-12-07_17:06:00-coarasa-Backup ; sudo /usr/local/bin/cdbop_batch.sh cms_general "update cms_general/cms_dropbox_pro_software_slc4_32.tpl ;commit "

25043 INFO: Modified the following templates: cms_general/cms_dropbox_pro_software_slc4_32.tpl

25043 INFO: Issue the following command to check the computers afected by the update.

25043 INFO: The command will have to wait around 1200s to Check the computers.

TimeOfLastRunInUnixTime=1260202105 sudo /usr/local/bin/Check_spma_ForComputersOnTemplate.sh cms_general/cms_dropbox_pro_software_slc4_32.tpl

1st Step

2nd Step

J.A. Coarasa 31

Page 32: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

[coarasa@srv-C2C04-15 ~]$ TimeOfLastRunInUnixTime=1260202105 sudo /usr/local/bin/Check_spma_ForComputersOnTemplate.sh cms_general/cms_dropbox_pro_software_slc4_32.tpl

25904 INFO: Writing output in /tmp/Check_spma_ForComputersOnTemplate.sh2009-12-07_17:11:26/output.log

25904 INFO: Update did not finished yet everywhere. Sleeping 1136 s

25904 INFO: The update affected the following computers: srv-c2c03-30 srv-c2c04-15 srv-c2c04-30 srv-c2c05-30 srv-s2d16-30

25904 INFO: All computers seem to be ok.

An Example session (II)

3rd Step

J.A. Coarasa 32

Page 33: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

[root@srv-C2C04-15 ~]# TimeOfLastRunInUnixTime=1259864892 /usr/local/bin/Check_spma_ForComputersOnTemplate.sh cms_cmssw/cms_cmssw_general_pro_software_slc4_32.tpl

926 INFO: Writing output in /tmp/Check_spma_ForComputersOnTemplate.sh2009-12-07_18:15:45/output.log

926 INFO: The update affected the following computers: bufu-c2a12-01 […] vmepcs2g19-16

926 INFO: The following computers could not be pinged to know the status: bufu-c2d16-27 bufu-c2e12-23 bufu-c2f13-20 fuval-c2a11-21 fuval-c2a11-08 fuval-c2a11-23 fuval-c2a11-11 fuval-c2a11-18 fuval-c2a11-10 fuval-c2a11-03 fuval-c2a11-05 fuval-c2a11-09 fuval-c2a11-19 fuval-c2a11-07 fuval-c2a11-16 fuval-c2a11-12 fuval-c2a11-24 fuval-c2a11-25 fuval-c2a11-17 fuval-c2a11-04 fuval-c2a11-06 fuval-c2a11-26 fuval-c2a11-22 fuval-c2a11-14 fuval-c2a11-27 fuval-c2a11-15 fuval-c2a11-30 fuval-c2a11-20 fuval-c2a11-29 fuval-c2a11-28 fuval-c2a11-13 ru-c2a05-14 ru-c2a08-12 ru-c2f03-10 srv-c2d05-18 vmepcs2b17-01

926 INFO: The following computers have an undefine state (Was the update applied?): dvbufu-c2f36-21 fuval-c2f12-19 ru-c2e06-08

926 INFO: The following computers did not catch the update now: dcspcs2g19-36 fuval-c2f12-02 ecal-laser-room-04 fuval-c2f12-05 fuval-c2f12-03 fuval-c2f12-01 fuval-c2f12-04 vmepcs2b17-33 vmepcs1d12-18 vmepcs2f17-14 srv-c2d17-04 vmepcs2b17-30 vmepcs2b18-05 vmepcs2b17-11 srv-s2f19-26 vmepcs2b17-08 srv-c2d17-17 srv-c2d17-16 vmepcs2b17-24 vmepcs2b17-06 vmepcs2f17-11 vmepcs2b17-05 vmepcs2b19-01 vmepcs1d12-17 srv-c2d17-15 srv-c2d05-01 […] vmepcs2g17-01

926 INFO: The following computers did not return from the ssh in time: ru-c2f02-14 ru-c2e03-13 ru-c2e05-01 ru-c2e02-08 ru-c2f01-05 ru-c2f07-16 ru-c2f06-06 srv-c2d04-24 vmepcs2b18-23

926 INFO: Check the file /tmp/Check_spma_ForComputersOnTemplate.sh2009-12-07_18:15:45/output.log for details or /var/log/Check_spma_ForComputersOnTemplate.sh.log for even more details

Checking in Detail

3rd Step

J.A. Coarasa 33

Page 34: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

The Tools for the cluster (Functionality)

• Allow you to run commands simultaneously on computers including a template an get a summary– Shutdown– Reboot– Power on– General commands:

CommandToExecuteRemotely="ls /tmp/finish_patch.log” \;

MatchingRegExp='^(ru|bufu)’ \;

QuattorTemplatesToCheck="pro_type_tracker_hardware_slc4_32" \;

ExecuteCommandRemotelyInComputers.sh –all

J.A. Coarasa 34

Page 35: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

Conclusions

• Quattor has been and is a scalable CMS for the CMS online Cluster (now 2500 computers)

• The quality and robustness of the configuration components is not always what you would like

• You can always resort to rpms to do some of the configuration

• The use of tools may greatly improve the Quattor experience

J.A. Coarasa 35

Page 36: Quattor in CMS  (a CMS for CMS)

11th Quattor Workshop, 16-18 March 2011, CERN, Geneva

Thank you.J.A. Coarasa 36