29
Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

Embed Size (px)

Citation preview

Page 1: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

Fabric Management with ELFms

BARC-CERN collaboration meeting

B.A.R.C. Mumbai

28/10/05

Presented by G. Cancio – CERN/IT

Page 2: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 2

Outline

The ELFms framework Quattor

Lemon

LEAF

Deployment status

Page 3: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 3

Fabric Management with ELFms (I)

ELFms stands for ‘Extremely Large Fabric management system’

Subsystems:

: configuration, installation and management of nodes

: system / service monitoring

: hardware / state management

ELFms manages and controls most of the nodes in the CERN CC ~2600 nodes out of ~ 3500 Multiple functionality and cluster size (batch nodes, disk servers, tape servers, DB, web,

…) Heterogeneous hardware (CPU, memory, HD size,..) Supported OS: Linux (RH7, RHES2/3/4, Scientific Linux 3/4 – 32/64bit) and Solaris

Node ConfigurationManagement

NodeManagement

Page 4: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 4

• Development is now coordinated by CERN/IT in collaboration with other HEP institutes

Fabric Management with ELFms (II)

• Quattor/Lemon are used in production in/outside CERN• LCG T1/T2 sites, ranging from 50-800 nodes/site

• Complete configuration of system and LCG Grid middlewarevia Quattor

• Integration with Grid services e.g. monitoring (GridICE, MonALISA)

• ELFms (Quattor/Lemon) were started in the scope of EU DataGrid.

Page 5: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 5

http://quattor.org

Page 6: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 6

Quattor

Quattor takes care of the configuration, installation and management of fabric nodes

A Configuration Database holds the ‘desired state’ of all fabric elements

• Node setup (CPU, HD, memory, software RPMs/PKGs, network, system services, location, audit info…)

• Cluster (name and type, batch system, load balancing info…)

Autonomous management agents running on the node for• Base installation

• Service (re-)configuration

• Software installation and management

Page 7: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 7

Node Configuration Manager NCM

CompA CompB CompC

ServiceA ServiceB ServiceC

RPMs / PKGs

SW Package ManagerSPMA

Managed Nodes

SW server(s)

HTTP

SWRepository RPMs

Architecture

Install server

HTTP / PXE System installer

Install Manager

base OS

XML configuration profiles

Configuration server

HTTP

CDB

SQL backend

SQL

CLI

GUI

scriptsXML backend

SOAP

Page 8: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 8

Configuration Information Configuration is expressed using a language called Pan

Information is arranged into templates Common properties set only once

Using templates it is possible to create hierarchies to match service structures

CERNCC

name_srv1: 137.138.16.5time_srv1: ip-time-1

lxbatchcluster_name: lxbatchmaster: lxmaster01pkg_add (lsf5.1)

lxpluscluster_name: lxpluspkg_add (lsf5.1) disk_srv

lxplus001 eth0/ip: 137.138.4.246 pkg_add (lsf6_beta)

lxplus020 eth0/ip: 137.138.4.225

lxplus029

Page 9: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 9

Quattor Deployment Quattor in complete control of Linux boxes (~ 2600 nodes, to grow to ~

6-8000 in 2008)

CDB holding information of all systems in CERN-CC

Over 90 NCM configuration components developed From basic system configuration to Grid services setup… (including desktops)

SPMA used for managing all software ~ 2 weekly security and functional updates (including kernel upgrades) Eg. KDE security upgrade (~300MB per node) and LSF client upgrade (v4 to

v5) in 15 mins, without service interruption Handles (occasional) downgrades as well

Developments ongoing: CDB: Fine-grained ACL protection to templates, namespaces, stronger typing,

improved SQL/XMLDB backend … Security: Deployment of HTTPS instead of HTTP (usage of host certificates) Re-engineering of Software Repository (BARC)

Proxy architecture for enhanced scalability …

Page 10: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 10

Proxy server setup

DNS-load balanced HTTP

M M’Backend(“Master”)

FrontendL1 proxies

L2 proxies(“Head” nodes)

Server cluster

HH H…

Rack 1 Rack 2… … Rack N

Installation images,RPMs,configuration profiles

Page 11: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 11

Quattor @ LCG/EGEE

EGEE and LCG have chosen quattor for managing their integration testbeds

Components available for a fully automated LCG-2 configuration

Many sites (a dozen, including LAL/IN2P3, NIKHEF, DESY,..) adopt quattor as fabric management framework…

In India: BARC, VECCAL (ALICE experiment)

… leading to improved core software robustness and completeness

Identified and removed site dependencies and assumptions

Documentation, installation guides, bug tracking, release cycles

Page 12: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 12

http://cern.ch/lemon

Page 13: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 13

Lemon – LHC Era Monitoring

CorrelationEngines

User Workstations

Web browser

Lemon CLI

User

MonitoringRepository

TCP/UDP

SOAP

SOAP

Repositorybackend

SQL

Nodes

Monitoring Agent

Sensor SensorSensor

RRDTool / PHP

apache

HTTP

Page 14: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 14

Deployment and Enhancements Smooth production running of Monitoring Agent and Oracle-based repository at

CERN-CC ~ 200 metrics sampled every 30s -> 1d; ~ 1 GB of data / day on ~ 1800 nodes No aging-out of data but archiving on MSS (CASTOR)

Usage outside CERN-CC, collaborations GridICE (>100 LCG sites) CMS-Online IN2P3 Others…

Hardened and enhanced EDG software Rich sensor set (from general to service specific eg. IPMI/SMART for disk/tape..) Generic multi-purpose sensor by BARC

Correlation and Fault Recovery Light-weight local self-healing module (eg. /tmp cleanup, restart daemons) Being re-engineered by BARC

Security for sample transport (TCP and UDP) (BARC)

Status and performance visualization pages …

Page 15: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 15

Monitoring the Fabric

Using a web-based status display:

CC Overview

Page 16: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 16

Monitoring the Fabric

Using a web-based status display:

CC Overview

Clusters and nodes

Page 17: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 17

Monitoring the Fabric

Using a web-based status display:

CC Overview

Clusters and nodes

VO’s

Page 18: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 18

Monitoring the Fabric

Using a web-based status display:

CC Overview

Clusters and nodes

VO’s

Power

Page 19: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 19

Monitoring the Fabric

Using a web-based status display:

CC Overview

Clusters and nodes

VO’s

Power

Error trending

Page 20: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 20

Monitoring the Fabric

Using a web-based status display:

CC Overview

Clusters and nodes

VO’s

Power

Error trending

Batch system

Page 21: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 21

Next Steps…

Service based views (user/mgmt perspective) Synoptical view of what services are running how – appropriate for

end users and managers

Needs to be built on top of Quattor and Lemon

Would require a separate service definition DB

Alarm system for operators Allow 24/24h 7/7d operators to receive, acknowledge, ignore,

hide, process alarms received via Lemon

Integrated into the Lemon Status pages

Page 22: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 24

http://cern.ch/leaf

Page 23: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 25

LEAF - LHC Era Automated Fabric

LEAF is a collection of workflows for high level node hardware and state management, on top of Quattor and LEMON:

HMS (Hardware Management System): Track systems through all physical steps in lifecycle eg. installation,

moves, vendor calls, retirement Automatically requests installs, retires etc. to technicians GUI to locate equipment physically HMS implementation is CERN specific (based on Remedy workflows), but

concepts and design should be generic

SMS (State Management System): Automated handling (and tracking of) high-level configuration steps

Reconfigure and reboot all cluster nodes for new kernel and/or physical move Drain and reconfig nodes for diagnosis / repair operations

Issues all necessary (re)configuration commands via Quattor extensible framework – plug-ins for site-specific operations possible

Page 24: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 26

Use Case: Move rack of machines

Node

HMS

NW DB

SMS

QuattorCDB

ServiceMgrTechnicians

1. new location

2. Set to standby

3. Update

4. Refresh

5. Take out of production•Close queues and drain jobs

•Disable alarms

6. Request move

9. Install work order

7a. Update

7b. Update

10. Set to production

11. Update

12. Refresh 13. Put into production

Page 25: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 27

LEAF Deployment

HMS in full production for all nodes in CC HMS heavily used during CC node migration (~ 1500 nodes)

SMS in production for all quattor managed nodes

Current work: More automation, and handling of other HW types for HMS

More service specific SMS clients (eg. tape & disk servers)

Developing ‘asset management’ GUI (CCTracker) -> BARC Multiple select, drag&drop nodes to automatically initiate HMS moves and

SMS operations

Interface to LEMON GUI

Page 26: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 28

Managing the Fabric

Visualize, locate and manage CC objects using high-level workflows

Visualize physical location of equipment

Page 27: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 29

Managing the Fabric

Visualize, locate and manage CC objects using high-level workflows

Visualize physical location of equipment

properties

Page 28: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 30

Managing the Fabric

Visualize, locate and manage CC objects using high-level workflows

Visualize physical location of equipment

properties

Initiate and track workflows on hardware and services e.g. add/remove/retire operations, update properties, kernel and

OS upgrades, etc

Page 29: Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT

German Cancio – CERN/IT - n° 31

ELFms is deployed in production at CERN Stabilized results from 3-year developments within EDG and LCG Established technology - from Prototype to Production Consistent full-lifecycle management and high automation level Providing real added-on value for day-to-day operations

Quattor and LEMON are generic software Other projects and sites getting involved

Site-specific workflows and “glue scripts” can be put on top for smooth integration with existing fabric environments

LEAF HMS and SMS

Summary

= + +More information:http://cern.ch/elfms