11
Partner Logo tp://cern.ch/hep-proj-grid-fabric DataGRID WP4 - Fabric Management Status report @ HEPiX 2002, Catania / IT, 18.04.02, Jan Iven Role and Architecture Status reports Installation Monitoring Others Short-term planning

Partner Logo DataGRID WP4 - Fabric Management Status report @ HEPiX 2002, Catania / IT, 18.04.02, Jan Iven Role and

Embed Size (px)

Citation preview

Page 1: Partner Logo  DataGRID WP4 - Fabric Management Status report @ HEPiX 2002, Catania / IT, 18.04.02, Jan Iven Role and

PartnerLogo

http://cern.ch/hep-proj-grid-fabric

DataGRID WP4 - Fabric Management

Status report @ HEPiX 2002, Catania / IT, 18.04.02, Jan Iven

Role and Architecture

Status reports

Installation

Monitoring

Others

Short-term planning

Page 2: Partner Logo  DataGRID WP4 - Fabric Management Status report @ HEPiX 2002, Catania / IT, 18.04.02, Jan Iven Role and

J.Iven - 2HEPiX 2002, Catania / IT

WP4 - Background information

WP4s objective: deliver the necessary tools to manage a computing fabric providing grid services on clusters scaling up to thousands of nodes.

Main scope: Fabric (system administration) management User job management (Grid and local)

Official participants: CERN (leading partner), INFN, NIKHEF, University of Heidelberg, ZIB (Berlin) and University of Edinburgh/ PPARC

Page 3: Partner Logo  DataGRID WP4 - Fabric Management Status report @ HEPiX 2002, Catania / IT, 18.04.02, Jan Iven Role and

J.Iven - 3HEPiX 2002, Catania / IT

FunctionalityEnterprise system administration - scalable to O(10K) nodes

Automated installation and maintenance of nodes Resource management (batch, interactive) Monitoring of events and performance Fault tolerance & recovery actions Fabric Configuration Management

Provision for running Grid jobs Authorization according to local policies Mapping Grid credential to local ones Publication of fabric resources and job information

Provision for running local jobs Sharing of resources according to local policies

Page 4: Partner Logo  DataGRID WP4 - Fabric Management Status report @ HEPiX 2002, Catania / IT, 18.04.02, Jan Iven Role and

J.Iven - 4HEPiX 2002, Catania / IT

DataGRID Architecture

Grid

Fabric

Grid

Collective Services

Information &

Monitoring

Replica Manager

Grid Scheduler

Local Application Local Database

Underlying Grid Services

Computing Element Services

Authorization Authentication and Accounting

Replica Catalog

Storage Element Services

SQL Database Services

Fabric services

ConfigurationManagement

Node Installation &Management

Monitoringand

Fault Tolerance

Resource Management

Fabric StorageManagement

Local Computing

Grid Application Layer

Data Management

Job Management

Metadata Management

Object to File Mapping

Service Index

WP4 tasks

Page 5: Partner Logo  DataGRID WP4 - Fabric Management Status report @ HEPiX 2002, Catania / IT, 18.04.02, Jan Iven Role and

J.Iven - 5HEPiX 2002, Catania / IT

Farm A (LSF) Farm B (PBS)

Grid User

(Mass storage,Disk pools)

Local User

Installation &Node Mgmt

ConfigurationManagement

Monitoring &Fault Tolerance

FabricGridification

ResourceManagement

Grid InfoServices(WP3)

WP4 subsystems

Other Wps

ResourceBroker(WP1)

Data Mgmt(WP2)

Grid DataStorage(WP5)

WP4 Architecture overview

- Interface between Grid-wide services and local fabric;

- Provides local authentication, authorization and mapping of grid credentials.

- provides transparent access to different cluster batch systems;

- enhanced capabilities (extended scheduling policies, advanced reservation, local accounting).

- provides a central storage and management of all fabric configuration information;

- central DB and set of protocols and APIs to store and retrieve information.

- provides the tools to install and manage all software running on the fabric nodes;

-Agent to install, upgrade, remove and configure software packages on the nodes.

-bootstrap services and software repositories;

- provides the tools for gathering monitoring information on fabric nodes;

-central measurement repository stores all monitoring information;

-- fault tolerance correlation engines detect failures and trigger recovery actions.

Page 6: Partner Logo  DataGRID WP4 - Fabric Management Status report @ HEPiX 2002, Catania / IT, 18.04.02, Jan Iven Role and

J.Iven - 6HEPiX 2002, Catania / IT

Status: Installation

Main focus: Linux on i386

Prototype available, based on a tool originally developed by Edinburgh University: LCFG (Local ConFiGuration system).

Main Features: automatic installation of O.S.

installation/upgrade/removal of software packages

configure and manage standard services and custom packages

Uses a central, hierarchical configuration database

Deployed on EDG testbed 1 in September 2001, interim releases every ~2 months. Currently ~70 nodes (CERN, INFN, NIKHEF, RAL + other UK, IFAE-Barcelona, LIP-Lisbon), +~30 being installed now (ESA-ESRIN, NIKJEF)

Page 7: Partner Logo  DataGRID WP4 - Fabric Management Status report @ HEPiX 2002, Catania / IT, 18.04.02, Jan Iven Role and

J.Iven - 7HEPiX 2002, Catania / IT

LCFG overview

A collection of agents read configuration parameters and either generate traditional config files or directly manipulate various services

Abstract configuration parameters for all nodes stored in a central repository

ldxprof

LoadProfile

Generic

Component

ProfileObject

rdxprof

ReadProfile

LCFG Objects

Local cache

Client nodes

Web Server

HTTP

XML Profile

LCFG Config Files

Make XMLProfile

Server

+inet.services telnet login ftp

+inet.allow telnet login ftp sshd

+inet.allow_telnet ALLOWED_NETWORKS

+inet.allow_login ALLOWED_NETWORKS

+inet.allow_ftp ALLOWED_NETWORKS

+inet.allow_sshd ALL

+inet.daemon_sshd yes

.....

+auth.users mickey

+auth.userhome_mickey /home/Mickey

+auth.usershell_mickey /bin/tcsh

Config files

<inet>

<allow cfg:template="allow_$ tag_$ daemon_$">

<allow_RECORD cfg:name="telnet">

<allow>192.168., 192.135.30.</allow>

</allow_RECORD>

.....

</auth>

<user_RECORD cfg:name="mickey">

<userhome>/home/Mickey</userhome>

<usershell>/bin/tcsh</usershell>

....

</user_RECORD>

XML profiles

ProfileObject

inet auth

/etc/services

/etc/inetd.conf

/etc/hosts.allow

in.telnetd : 192.168., 192.135.30.

in.rlogind : 192.168., 192.135.30.

in.ftpd : 192.168., 192.135.30.

sshd : ALL

/etc/shadow

/etc/group

/etc/passwd

....

mickey:x:999:20::/home/Mickey:/bin/tcsh

....

Slide prepared by INFN

Page 8: Partner Logo  DataGRID WP4 - Fabric Management Status report @ HEPiX 2002, Catania / IT, 18.04.02, Jan Iven Role and

J.Iven - 8HEPiX 2002, Catania / IT

Monitoring + Fault Tolerance

Principle: A Monitoring Agent running on each node samples the configured

metrics via sensors

The samples are sent to a central Monitoring Repository and stored. The samples are also stored locally to allow for local fault tolerance if appropriate

Correlation engines act on local or central data Trigger actions, e.g. Alarms or Recovery Create higher-level data

User Interface allows to query Repository, displays alarms

Status: Component APIs have been defined

First version of Agent available, deployed on EDG (~15) and CERN (~1000) nodes

Current prototype uses simple DB based on flat files

Page 9: Partner Logo  DataGRID WP4 - Fabric Management Status report @ HEPiX 2002, Catania / IT, 18.04.02, Jan Iven Role and

J.Iven - 9HEPiX 2002, Catania / IT

Other Statuses (Stati?) Fault Tolerance

Prototype which periodically checks the CPU/chip set temperatures as well as the fan speeds.

Configuration Management High Level Configuration Description Language: declarative way of describing

configuration of computer systems. First draft available.

High Level configuration Language to Low Level Configuration language Compiler. Alpha prototype available.

Central Configuration Database (CDB) (central store for all fabric configuration information). Being designed.

Resource Management Working on first prototype of the Resource Management Subsystem

Gridification Enhancing the Globus gatekeeper with plug-in authorization and credential

mapping components.

Page 10: Partner Logo  DataGRID WP4 - Fabric Management Status report @ HEPiX 2002, Catania / IT, 18.04.02, Jan Iven Role and

J.Iven - 10HEPiX 2002, Catania / IT

Planning up to Release 2 (09/02)

Installation: Split "Production" and "Research" Deliver Production quality LCFG in R2

Move to latest LCFG: support PXE installations, RedHat7.2

New Configuration Schema

Deploy new configuration language compiler

Monitoring: complete Prototype in R2 Prototype-quality agent , deploy everywhere

Simple alarm display / GUI

Rework transport layer (UDP ® reliable transport)

Integrate SNMP

Select and move to "real" database

Page 11: Partner Logo  DataGRID WP4 - Fabric Management Status report @ HEPiX 2002, Catania / IT, 18.04.02, Jan Iven Role and

J.Iven - 11HEPiX 2002, Catania / IT

Summary and LinksWP4 is well-established: software deployed and used

Release 2 will be Evolution, not Revolution

@CERN: close collaboration between WP4, farm admins and LCG team

DataGRID project: http://cern.ch/eu-datagrid

DataGRID WP4 : http://cern.ch/hep-proj-grid-fabric

LCFG: http://www.lcfg.org