61
Building, Monitoring and Maintaining a Grid Jorge Luis Rodriguez University of Florida [email protected] July 11-15, 2005

Building, Monitoring and Maintaining a Grid

  • Upload
    velika

  • View
    28

  • Download
    2

Embed Size (px)

DESCRIPTION

Building, Monitoring and Maintaining a Grid. Jorge Luis Rodriguez University of Florida [email protected] July 11-15, 2005. Introduction. What we’ve already learned What are grids, why we want them and who is using them: GSW intro & L1… Grid Authentication and Authorization: L2 - PowerPoint PPT Presentation

Citation preview

Page 1: Building, Monitoring and Maintaining a Grid

Building, Monitoring and Maintaining a Grid

Jorge Luis RodriguezUniversity of Florida

[email protected] 11-15, 2005

Page 2: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 2Grid Summer Workshop 2005, July 11-15

• What we’ve already learned– What are grids, why we want them and who is

using them: GSW intro & L1… – Grid Authentication and Authorization: L2– Harnessing CPU cycles with condor: L3– Data Management and the Grid: L4

• In this lecture – Fabric level infrastructure: Grid building blocks– The Open Science Grid

Introduction

Page 3: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 3Grid Summer Workshop 2005, July 11-15

• Computational Clusters

• Storage Devices

• Networks

• Grid Resources and Layout:– User Interfaces– Computing Elements– Storage Elements– Monitoring Infrastructure…

Grid Building Blocks

Page 4: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 4Grid Summer Workshop 2005, July 11-15

Dell Cluster at UFlorida’s High Performance Center

Computer ClustersCluster Management

“frontend”

Tape Backup robots

I/O Servers typically RAID fileserver

Disk Arrays The bulk are Worker Nodes

A few Headnodes, gatekeepers and

other service nodes

Page 5: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 5Grid Summer Workshop 2005, July 11-15

A Typical Cluster Installation

Network Switch

Pentium III

Pentium III

Pentium III

Head Node/Frontend Server

Pentium III

Worker Nodes

WANWAN

Cluster Management• OS Deployment• Configuration• Many options

ROCKS (kickstart)OSCAR (sys imager)Sysconfig

• •

Computing Cycles Data Storage Connectivity I/O Node + Storage

Page 6: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 6Grid Summer Workshop 2005, July 11-15

Networking• Internal Networks (LAN)

– Private, accessible only to servers inside a facility

– Some sites allow outbound connectivity via Network Address Translation

– Typical technologies used• Ethernet (0.1, 1 & 10 Gbps)• HP, Low Latency interconnects

– Myrinet: 2, 10 Gps– Infiniband: max at 120Gps

• External connectivity – Connection to Wide Area Network – Typically achieved via same

switching fabric as internal interconnects

Network Switch

Pentium III

Pentium III

Pentium III

Head Node/Frontend Server

Pentium III

Worker Nodes

WANWAN

“one planet one network”

Global Crossing

I/O Node + Storage

Page 7: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 7Grid Summer Workshop 2005, July 11-15

The Wide Area NetworkEver increasing network capacities are what make grid

computing possible, if not inevitable

The Global Lambda Integrated Facility for Research and Education (GLIF)

Page 8: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 8Grid Summer Workshop 2005, July 11-15

• Batch scheduling systems– Submit many jobs through a head node

#!/bin/shfor each i in $list_o_jobscriptsdo /usr/local/bin/condor_submit $idone

– Execution done on worker nodes

• Many different batch systems are deployed on the grid

– condor (highlighted in lecture 3)– pbs, lsf, sge…

Primary means of controlling CPU usage, enforcing allocation policies and scheduling of jobs on the local computing infrastructure

Computation on a Clusters

Network Switch

Pentium III

Pentium III

Pentium III

Head Node/Frontend Server

Pentium III

Worker Nodes

WANWAN

I/O Node + Storage

Page 9: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 9Grid Summer Workshop 2005, July 11-15

Storage Devices Many hardware technologies deployed from:

Single fileserver• Linux box with lots of disk: RAID 5…• Typically used for work space and temporary space

a.k.a. “tactical storage”

to

Large Scale Mass Storage Systems• Large peta-scale disk + tape robots systems• Ex: FNAL’s Enstore MSS

– dCache disk frontend

– Powderhorn tape backend

• Typically used as permanent stores

a.k.a “strategic storage”

StorageTek Powderhorn Tape Silo

Page 10: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 10Grid Summer Workshop 2005, July 11-15

Tactical Storage• Typical Hardware Components

– Servers: Linux, RAID controllers…– Disk Array

• IDE, SCSI, FC• RAID levels 5, 0, 50, 1…

• Local Access– Volumes mounted across compute

cluster• nfs, gpfs, afs…

– Volume Virtualization• dCache• pnfs

• Remote Access– gridftp: globus-url-copy– SRM interface

• space reservation• request scheduling

Network Switch

Pentium III

Pentium III

Pentium III

Head Node/Frontend Server

Pentium III

Worker Nodes

WANWAN

/share/DATA = nfs:/tmp1/share/TMP = nfs:/tmp2

/share/DATA = nfs:/tmp1/share/TMP = nfs:/tmp2

/tmp1/tmp2

Page 11: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 11Grid Summer Workshop 2005, July 11-15

Layout of Typical Grid Site

Computing Fabric

Grid MiddlewareGrid Level Services

++

=>

A Grid Site

=>VDTVDT

OSGOSG

ComputeElement

StorageElement

User Interface

Authz server

Monitoring Element

Monitoring Clients Services

Data Management

Services

Grid Operations

The Gr id

Page 12: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 12Grid Summer Workshop 2005 July 11-15

Grid3 and

Open Science Grid

Page 13: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 13Grid Summer Workshop 2005, July 11-15

The Grid3 Project

Grid3: A Shared Grid Infrastructure for Scientific Research

Grid3: A Shared Grid Infrastructure for Scientific Research

Grid3: A Shared Grid Infrastructure for Scientific Research

Grid3: A Shared Grid Infrastructure for Scientific Research

High Energy Physics

Astronomy & AstrophysicsThe Sloan Digital Sky Survey (SDSS) is an ambitious project which plans

to map in detail one-quarter of the entire sky, determining positions and absolute brightness of more that 100 million celestial objects. It will also measure the distances to more than a million galaxies and quasars.

On Grid3 SDSS has run the maxBcg cluster finding program to measure distances and masses of clusters of galaxies. Recently SDSS has completed a run of the coadd application. This is a data intensive application that combines visual information from sources in SDSS surveys. Its goal is to search for interesting large scale structures, such as gravitational arcs behind distant clusters of galaxies.

High Energy Physics applications from three experiments are currently running on Grid3. The Large Hadron Collider (LHC) A Toriodal LHC ApparatuS (ATLAS) and Compact Muon Solenoid (CMS) experiments are running Monte Carlo simulations for their respective data challenges and have successfully processed millions of events on Grid3. These events have contributing significantly to the world-wide production efforts in preparation for Physics Technical Design Reports due in coming year. The BTeV collaboration has deployed and run its Monte Carlo simulation application on Grid3. BTeV is a proposed B physics experiment at Fermi National Laboratory.

ATLAS and CMS are also running analysis applications on Grid3. These applications operate in a significantly differently mode than do Monte Carlo production applications. They typically sort through large data sets of either Monte Carlo simulation or detector generated data in search of physics signals of interest. Example of important topics are searches for the Higgs boson, super symmetric particles and extra physical dimensions.

ATLAS

CMS

LIGO

SDSS

The Laser Interferometer Gravitational-Wave Observatory (LIGO) is a facility dedicated to the detection of cosmic gravitational waves and the harnessing of these waves for scientific research.

On Grid3 LIGO is analyzing recent data in search of signals from the inspiral of massive compact objects like neutron stars or black holes. Other analysis are searching for the continuous signals from thegravitational-wave equivalent of pulsars.

The CMS experiment being assembled by workers.

Artist rendition of gravitational waves emanating from a massive luminescent body

A collection of images from the SDSS date set

Grid3 ApplicationsHigh Energy Physics

Astronomy & AstrophysicsThe Sloan Digital Sky Survey (SDSS) is an ambitious project which plans

to map in detail one-quarter of the entire sky, determining positions and absolute brightness of more that 100 million celestial objects. It will also measure the distances to more than a million galaxies and quasars.

On Grid3 SDSS has run the maxBcg cluster finding program to measure distances and masses of clusters of galaxies. Recently SDSS has completed a run of the coadd application. This is a data intensive application that combines visual information from sources in SDSS surveys. Its goal is to search for interesting large scale structures, such as gravitational arcs behind distant clusters of galaxies.

High Energy Physics applications from three experiments are currently running on Grid3. The Large Hadron Collider (LHC) A Toriodal LHC ApparatuS (ATLAS) and Compact Muon Solenoid (CMS) experiments are running Monte Carlo simulations for their respective data challenges and have successfully processed millions of events on Grid3. These events have contributing significantly to the world-wide production efforts in preparation for Physics Technical Design Reports due in coming year. The BTeV collaboration has deployed and run its Monte Carlo simulation application on Grid3. BTeV is a proposed B physics experiment at Fermi National Laboratory.

ATLAS and CMS are also running analysis applications on Grid3. These applications operate in a significantly differently mode than do Monte Carlo production applications. They typically sort through large data sets of either Monte Carlo simulation or detector generated data in search of physics signals of interest. Example of important topics are searches for the Higgs boson, super symmetric particles and extra physical dimensions.

ATLAS

CMS

LIGO

SDSS

The Laser Interferometer Gravitational-Wave Observatory (LIGO) is a facility dedicated to the detection of cosmic gravitational waves and the harnessing of these waves for scientific research.

On Grid3 LIGO is analyzing recent data in search of signals from the inspiral of massive compact objects like neutron stars or black holes. Other analysis are searching for the continuous signals from thegravitational-wave equivalent of pulsars.

The CMS experiment being assembled by workers.

Artist rendition of gravitational waves emanating from a massive luminescent body

A collection of images from the SDSS date set

Grid3 Applications

Page 14: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 14Grid Summer Workshop 2005, July 11-15

The Grid3 grid

A total of 35 sites

Over 3500 CPUs

Operations Center @ iGOC

Began operations Oct. 2003

A total of 35 sites

Over 3500 CPUs

Operations Center @ iGOC

Began operations Oct. 2003

Page 15: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 15Grid Summer Workshop 2005, July 11-15

Grid3 Metrics

• CPU usage per VO 03/04 thru 09/04

• Data challenges for ATLAS and CMS

CPU usage averaged over a day 03/04-9/04

CMS’ DC04ATLAS’ DC2

Simulation Events

41.4 x 106 evts, 11/03-3/05

• Grid3 contribution to MC production for CMS

• USMOP = USCMS S&C + Grid3

Page 16: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 16Grid Summer Workshop 2005, July 11-15

The Open Science Grid

A consortium of Universities and National Laboratories to building a sustainable grid

infrastructure for Science in the U.S.

Page 17: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 17Grid Summer Workshop 2005, July 11-15

Grid3 Open Science Grid• Begin with iterative extension to Grid3

– Shared resources, benefiting broad set of disciplines– Realization of the critical need for operations– More formal organization needed because of scale

• Build OSG from laboratories, universities, campus grids– Argonne, Fermilab, SLAC, Brookhaven, Berkeley Lab, Jeff. Lab– UW Madison, U Florida, Purdue, Chicago, Caltech, Harvard, etc.

• Further develop OSG– Partnerships and contributions from other sciences, universities– Incorporation of advanced networking– Focus on general services, operations, end-to-end performance

Page 18: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 18Grid Summer Workshop 2005, July 11-15

Enterprise

Technical Groups

ResearchGrid Projects

VOs

Researchers

Sites

Service Providers

Universities,Labs

activity1activity

1activity1Activities

Advisory Committee

Core OSG Staff(few FTEs, manager)

OSG Council(Chair, officers from

major stakeholders, PIs, Faculty & Lab managers)

Executive Board(8-15 representatives

Chair, Officers)

OSG Organization

Page 19: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 19Grid Summer Workshop 2005, July 11-15

OSG Integration

Provisioning

OSG Activities and Tech. Groups

OSGdeployment

Policy

Privilege

TG-Policy

TG-Monitoring &

InformationDocs

DRM

IPB

dCache

TG-Storage

Operations

TG-Support Centers

MISMonitoring &

Information Systems

A Sampling of current OSG TG and Activities

Page 20: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 20Grid Summer Workshop 2005, July 11-15

OSG Technical Groups & Activities• Technical Groups address and coordinate

– Propose and carry out activities related to their given areas

– Liaise & collaborate with other peer projects (U.S. & international)

– Participate in relevant standards organizations.– Chairs participate in Blueprint, Grid Integration and

Deployment activities• Activities are well-defined, scoped set of tasks

contributing to the OSG– Each Activity has deliverables and a plan– … is self-organized and operated– … is overseen & sponsored by one or more Technical

Groups

Page 21: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 21Grid Summer Workshop 2005 July 11-15

OSG Authentication & Authorization

“Authz”

Privilege

Privilege

Page 22: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 22Grid Summer Workshop 2005, July 11-15

Authentication & Authorization

• Authentication: Verify that you are who you say you are– OSG users typically use the DOEGrids CA – OSG sites also accept CAs from LCG and other

organizations

• Authorization: Allow a particular user to use a particular resource– Legacy Grid3 method, gridmap-files– New Privilege method, employed at primarily at US-

LHC sites

Page 23: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 23Grid Summer Workshop 2005, July 11-15

OSG Authentication: Grid3 Style

VOMS server @ iGOC

edg-mkgridmap

VOMS server @ LLab

VOMS server @ OLab

user DNs

user DNs

user DNs

site a client

site b client

site n client

mapping of user’s grid credentials (DN) to local site group account

gridmap-file

gridmap-file

gridmap-file

LColab, Lexp1

Oexp1, Aexp2…

iVDGL, GADU…

DN mappings

VOMS= Virtual Organization Management SystemDN=Distinguished Nameedg= European Data Grid (EU grid project)

Page 24: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 24Grid Summer Workshop 2005, July 11-15

The Privilege Project

Application of a

Role Based Access Control model for OSG

An advanced authorization mechanism

Page 25: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 25Grid Summer Workshop 2005, July 11-15

The Privilege Project Provides

• Centralized, dynamic mapping of grid users to local OS qualifiers– VOMSes are still used as grid identities DN db– Gone, however are the static grid-mapfiles

• Improvement to Access Control policy implementation– The GUMS service– Access rights granted based on user’s

• VO membership• User selected role(s)

Grid Identity Unix ID Certificate DN local user VO Role(s) local group(s)

Grid Identity Unix ID Certificate DN local user VO Role(s) local group(s)

Page 26: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 26Grid Summer Workshop 2005, July 11-15

Server with VDT 1.3 based on GT3.2Server with VDT 1.3 based on GT3.2

Server with VDT1.3 based on gt3.2

Client server (UI)

Web-Service Container

Privilege Project Components

VOMSServer

Servers with VDT 1.3 based on gt3.2

Gridmapcallout

gridFTP &

Gate-keeper

job-manager

PRIMAmodule

6. instantiates

GUMS Identity Mapping Service

(manages user accounts on

resources, incl. dynamic

allocation)

3. Standard globus-job-runrequest with VOMS-extended proxy

Client tool for role selection:

VOMS-proxy-init

1. VOMS-proxy-init requestwith specified role

2. Retrieves VO membership and role attribute

User Management(VOMSRS)

SAML Statement: Decision=Permit, with obligation local UID=xyz, GID=xyz

5. HTTPS/SOAP Response:

May user “Markus Lorch” of “VO=USCMS / Role=prod” access this resource?

4. HTTPS/SOAP Request: SAML Query:

VO membership synchronization

VOMSAttribute

Repository

An OSG site

A VO service

Page 27: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 27Grid Summer Workshop 2005, July 11-15

Authorization in OSGOSG will support multiple modes of operation

• Will work with legacy client and/or server combinations– Full legacy server: VOMS+edg-makegridmap

• Privilege enabled client requests (VOMS-proxy-init) will be mapped to local user as previously: VOMS extensions ignored

– Full Privilege server: All Privilege components enabled• Legacy client requests are supported but user can not be mapped to a

different VO or assume a different role under its own VO

• Also a Grid3 compatibility mode will be supported– Supports privilege operations but without the globus PRIMA callout

and thus can not support the “role” based mapping – The gatekeeper/gridFTP server is operated with legacy Grid3 stack

(gtk 2.4 servers…)– Automatically provides reverse maps for Grid3/OSG accounting

Page 28: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 28Grid Summer Workshop 2005, July 11-15

Server with VDT 1.3 based on GT3.2Server with VDT 1.3 based on GT3.2

Server with VDT1.3 based on gt3.2

Client server (UI)

Web-Service Container

Grid3 Compatibility mode

VOMSServer

Legacy Grid3 Servers

gridFTP &

Gate-keeper

job-manager

4. instantiates

GUMS Identity Mapping Service

(manages user accounts on

resources, incl. dynamic

allocation)

3. Standard globus-job-runrequest with VOMS-extended proxy

Client tool for role selection:

VOMS-proxy-init

1. VOMS-proxy-init request

2. Retrieves VO membership and role attribute

User Management(VOMSRS)

May user “Markus Lorch” of “VO=USCMS / Role=prod” access this resource? The role however, is ignored

A. Periodically queries GUMS server

VO membership synchronization

VOMSAttribute

Repository

An OSG site with legacy gatekeeper or gridFTP server

A VO service

grid-mapfile

cron job

DN mapping to a local UID=xxx, but no role based assignment

B. Response create static grid-mapfile

reversemap

to OSG accounting

Page 29: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 29Grid Summer Workshop 2005 July 11-15

OSG Grid Monitoring

MISMonitoring &

Information Systems

MISMonitoring &

Information Systems

Page 30: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 30Grid Summer Workshop 2005, July 11-15

OSG Monitoring & Information System

• Grid Level Clients– User interfaces & APIs– Graphical Displays

• Gridcat • MonALISA• ACDC job Monitor• Metrics DataViewer• BDII

• Site Level infrastructure– Scripts & tools collectors – Databases – Application APIs

• MonALISA server– MonALISA client– Metrics DataViewer

• OSG MIS-CI– ACDC job Monitor– GidCat

• MDS: Generic Info. Provider– BDII

Page 31: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 31Grid Summer Workshop 2005, July 11-15

Site Level Infrastructure

stor_stat

Ganglia

GIP

job_stateMonitoringinformationDataBase

Collector

others…

MonALISA

DiscoveryService

ACDC

GINI, SOAP, WDSL…

GRAM: jobman-mis

https: Web Services

GridCat

• MonALISA server • MIS-Core Infrastructure• MDS

Monitoring Information

Consumer API

HistoricalinformationDataBase

Page 32: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 32Grid Summer Workshop 2005, July 11-15

OSG Grid Level Clients

• Tools provide basic information about OSG resources– Resource catalog: official tally of OSG sites– Resource discovery: what services are available,

where are they and how do I access it – Metrics Information: Usage of resources over time

• Used to asses scheduling priorities– Where and when should I send my jobs?– Where can I put my output?

• Used to monitor health and status of the Grid

Page 33: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 33Grid Summer Workshop 2005, July 11-15

GridCathttp://osg-cat.grid.iu.edu

http://www.ivdgl.org/grid3/gridcat

Page 34: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 34Grid Summer Workshop 2005, July 11-15

MonALISA

Page 35: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 35Grid Summer Workshop 2005 July 11-15

OSG Provisioning

• OSG Software Cache • OSG Meta Packager• Grid Level Services

Provisioning

Provisioning

Page 36: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 36Grid Summer Workshop 2005, July 11-15

The OSG Software Cache

• Most software comes from the VDT

• OSG components include– VDT configuration scripts – Some OSG specific packages too

• Pacman is the OSG Meta-packager– This is how we deliver the entire cache to

resources

Page 37: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 37Grid Summer Workshop 2005, July 11-15

What is The VDT ?• A collection of software

– Grid software – Virtual data software– Utilities

• An easy installation mechanism– Goal: Push a button, everything just works– Two methods:

• Pacman: installs and configures it all• RPM: installs some of the software, but no configuration

• A support infrastructure– Coordinate bug fixing– Help desk

Page 38: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 38Grid Summer Workshop 2005, July 11-15

What is in the VDT?Condor Group

Condor/Condor-G

DAGMan

Fault Tolerant Shell

ClassAds

NeST

Globus Alliance (3.2 pre web)

Job submission (GRAM)

Information service (MDS)

Data transfer (GridFTP)

Replica Location (RLS)

EDG & LCG

Make Gridmap

Cert. Revocation list updater

Glue & Gen. Info. provider

VOMS

Condor Group

Condor/Condor-G

DAGMan

Fault Tolerant Shell

ClassAds

NeST

Globus Alliance (3.2 pre web)

Job submission (GRAM)

Information service (MDS)

Data transfer (GridFTP)

Replica Location (RLS)

EDG & LCG

Make Gridmap

Cert. Revocation list updater

Glue & Gen. Info. provider

VOMS

ISI & UC

Chimera & Pegasus

NCSA

MyProxy

GSI OpenSSH

UberFTP

LBL

PyGlobus

Netlogger

DRM

Caltech

MonALISA

jClarens (WSR)

VDT

VDT System Profiler

Configuration software

ISI & UC

Chimera & Pegasus

NCSA

MyProxy

GSI OpenSSH

UberFTP

LBL

PyGlobus

Netlogger

DRM

Caltech

MonALISA

jClarens (WSR)

VDT

VDT System Profiler

Configuration software

US LHC

GUMS

PRIMA

Others

KX509 (U. Mich.)

Java SDK (Sun)

Apache HTTP/Tomcat

MySQL

Optional packages

Globus-Core {build}

Globus job-manager(s)

US LHC

GUMS

PRIMA

Others

KX509 (U. Mich.)

Java SDK (Sun)

Apache HTTP/Tomcat

MySQL

Optional packages

Globus-Core {build}

Globus job-manager(s)

Condor Group

Condor/Condor-G

DAGMan

Fault Tolerant Shell

ClassAds

NeST

Globus Alliance (3.2 pre web)

Job submission (GRAM)

Information service (MDS)

Data transfer (GridFTP)

Replica Location (RLS)

EDG & LCG

Make Gridmap

Cert. Revocation list updater

Glue & Gen. Info. provider

VOMS

Condor Group

Condor/Condor-G

DAGMan

Fault Tolerant Shell

ClassAds

NeST

Globus Alliance (3.2 pre web)

Job submission (GRAM)

Information service (MDS)

Data transfer (GridFTP)

Replica Location (RLS)

EDG & LCG

Make Gridmap

Cert. Revocation list updater

Glue & Gen. Info. provider

VOMS

ISI & UC

Chimera & Pegasus

NCSA

MyProxy

GSI OpenSSH

UberFTP

LBL

PyGlobus

Netlogger

DRM

Caltech

MonALISA

jClarens (WSR)

VDT

VDT System Profiler

Configuration software

ISI & UC

Chimera & Pegasus

NCSA

MyProxy

GSI OpenSSH

UberFTP

LBL

PyGlobus

Netlogger

DRM

Caltech

MonALISA

jClarens (WSR)

VDT

VDT System Profiler

Configuration software

US LHC

GUMS

PRIMA

Others

KX509 (U. Mich.)

Java SDK (Sun)

Apache HTTP/Tomcat

MySQL

Optional packages

Globus-Core {build}

Globus job-manager(s)

US LHC

GUMS

PRIMA

Others

KX509 (U. Mich.)

Java SDK (Sun)

Apache HTTP/Tomcat

MySQL

Optional packages

Globus-Core {build}

Globus job-manager(s)

Core software

User Interface

Computing Element

Storage Element

Authz System

Monitoring System

Page 39: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 39Grid Summer Workshop 2005, July 11-15

Pacman• Pacman is:

– A software environment installer (or Meta-Packager)– A a language for defining software environments– An interpreter that allowing creation, installation, configuration,

update, verification and repair such environments

• Pacman makes installation of all types of software easy

LCG/Scram

ATLAS/CMT

Globus/GPT Nordugrid/RPM

LIGO/tar/make D0/UPS-UPD

CMS DPE/tar/make

NPACI/TeraGrid/tar/make

OpenSource/tar/make

Commercial/tar/make

% pacman –get iVDGL:Grid3

Enables us to easily and coherently combine and manage software from arbitrary sources.

ATLASATLAS

NPACINPACI

D-ZeroD-Zero

iVDGLiVDGL

UCHEPUCHEPVDT

CMS/DPE

LIGO Enables remote experts to define installation config updating for everyone at once.

Page 40: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 40Grid Summer Workshop 2005, July 11-15

Pacman Installation

1. Download Pacman– http://physics.bu.edu/~youssef/pacman/

2. Install the “package”– cd <install-directory>– pacman -get OSG:OSG_CE_0.2.1– ls

condor/ globus/ post-install/ setup.sh

edg/ gpt/ replica/ vdt/

ftsh/ perl/ setup.csh vdt-install.log

/monalisa ...

Page 41: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 41Grid Summer Workshop 2005, July 11-15

Grid Level Services

• OSG Grid level Monitoring infrastructure– Mon. & Info. System(s) top level database– Monitoring Authz: mis user

• OSG Operations infrastructure– Websites, Ops page, web servers …– Trouble ticket system

• OSG top level Replica Location Service

Page 42: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 42Grid Summer Workshop 2005 July 11-15

OSG Operations

Operations

Operations

Page 43: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 43Grid Summer Workshop 2005, July 11-15

Grid Operations

• Monitoring Grid status– Use of Grid monitors and verification routines

• Report, route and track problems and resolution – Trouble ticket system

• Repository of resource contact information

Do this as part of a National distributed system

Monitoring and Maintaining the Health of the Grid

User supportApplication supportVO issues

Page 44: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 44Grid Summer Workshop 2005, July 11-15

Operations Model in OSG

Page 45: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 45Grid Summer Workshop 2005, July 11-15

Ticket Routing in OSG

1 2

3

4

5

678

910

11

12 OSG infrastructureSC private infrastructure

User in VO1 notices problem at RP3, notifies their SC (1).SC-C opens ticket (2) and assigns to SC-F.SC-F gets automatic notice (3) and contacts RP3 (4).Admin at RP3 fixes and replies to SC-F (5).SC-F notes resolution in ticket (6).SC-C gets automatic notice of update to ticket (7).SC-C notifies user of resolution (8).User confirms resolution (9).SC-C closes ticket (10).SC-F gets automatic notice of closure (11).SC-F notifies RP3 of closure (12).

Page 46: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 46Grid Summer Workshop 2005 July 11-15

OSG Integration and Development

IntegrationIntegration

Page 47: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 47Grid Summer Workshop 2005, July 11-15

Path for New Services in OSGO

SG

Dep

loym

ent

Act

ivit

y

Metrics &Certification

Applicationvalidation

VO Application

SoftwareInstallation

OSG Integration Activity

Release Description

MiddlewareInteroperability

Software &packaging

Functionality & Scalability

Tests

Readiness plan adopted

Service deployment

OS

G O

per

atio

ns-

Pro

visi

on

ing

Act

ivit

y

ReleaseCandidate

Readinessplan

Effort

Resources

feedback

Page 48: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 48Grid Summer Workshop 2005, July 11-15

OSG Integration Testbed Layout

Stable ProductionRelease

IntegrationRelease

Resources enterand leave as necessary

OSG Integration Testbed

Serviceplatform

applications, test harness, clients

VO contributed

Page 49: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 49Grid Summer Workshop 2005, July 11-15

OSG Integration Testbed

Page 50: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 50Grid Summer Workshop 2005 July 11-15

OSG Further Work

• Managed Storage

• Grid Scheduling

• More

Page 51: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 51Grid Summer Workshop 2005, July 11-15

Managing Storage

• Problems: – No real way to control the movement of files

into and out of site• Data is staged by fork processes! • Anyone with access to the site can submit such a

request

– There is also no space allocation control• A grid user can dump files of any size on a

resource• If users do not cleanup sys admin have to

interveneThese can easily overwhelm a resource

Page 52: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 52Grid Summer Workshop 2005, July 11-15

Managing Storage

• A Solution: SRM (Storage Resource Manager) • Grid enabled interface to put data on a site

– Provides scheduling of data transfer requests– Provides reservation of storage space

• Technologies in the OSG pipeline– dCache/SRM (disk cache with SRM)

• Provided by DESY & FNAL• SE(s) available to OSG as a service from the USCMS VO

– DRM (Disk Resource Manager) • Provided by LBL• Can be added on top of a normal UNIX file system

DRM

dCache

TG-Storage

$> globus-url-copy srm://ufdcache.phys.ufl.edu/cms/foo.rfz \ gsiftp://cit.caltech.edu/data/bar.rfz

Page 53: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 53Grid Summer Workshop 2005, July 11-15

Grid SchedulingThe problem: With job submission this still happens!

OSG Site B

OSG Site AUser InterfaceVDT Client

?

OSG Site X

Why do I have to do this by hand? @?>#^%$@#

Page 54: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 54Grid Summer Workshop 2005, July 11-15

Grid Scheduling

• Possible Solutions– Sphinx (GriPhyN, UF)

• Work flow based dynamic planning (late binding)• Policy based scheduling • More details ask Laukik

– Pegasus (GriPhyN, ISI/UC)• DAGman based planner and Grid scheduling (early binding)• More details in Work Flow lecture 6

– Resource Broker (LCG)• Match maker based Grid scheduling• Employed by application running on LCG Grid resources

Page 55: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 55Grid Summer Workshop 2005, July 11-15

Much Much More is Needed

• Continue the hardening of middleware and other software components

• Continue the process of federating with other Grids– TeraGrid– LHC/EGEE

• Continue to synchronize the Monitoring and Information Service Infrastructure

• Improve documentation• • •

TG-Interoperability

Page 56: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 56Grid Summer Workshop 2005 July 11-15

Summary and

Conclusions

Page 57: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 57Grid Summer Workshop 2005, July 11-15

Conclude with a simple example1. Log on to a User Interface; 2. Get your grid proxy “logon to the grid” grid-proxy-init3. Check OSG MIS clients

• To get list of available sites: depends on your VO affiliation• To discover site specific information needed by your job ie,

• Available services: hostname, port numbers• Tactical storage location: $app, $data, $tmp, $wntmp

4. Install your application bins at selected sites5. Submit your jobs to selected sites via condor-G6. Check OSG MIS clients to see if jobs have completed7. Do something like this: If [ 0 ] then echo “Have a coffee (beer, margarita…)” else echo “its going to be a long night” fi

Page 58: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 58Grid Summer Workshop 2005, July 11-15

OSG Ribbon CuttingOSG Opens for Business on July 20th, 2005

All

Some

OSG-Integration Testbed

Grandfathered Grid3

Page 59: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 59Grid Summer Workshop 2005 July 11-15

Conclusion

Lots of progress since 1999

But a lot of work still remains!

http://www.opensciencegrid.orghttp://www.griphyn.org

http://www.ivdgl.orghttp://www.ppdg.org

Page 60: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 60Grid Summer Workshop 2005 July 11-15

The End

Page 61: Building, Monitoring and Maintaining a Grid

Jorge Luis Rodriguez 61Grid Summer Workshop 2005, July 11-15

• Grid3 and the Open Science Grid– Grid3: A Shared Grid Infrastructure for Scientific Applications

• Introduction• The Grid3 Grid• Grid Metrics

– The Open Science Grid• The OSG Consortium• OSG organization• OSG Technical Groups and Activities

– OSG Provisioning (Provision Activity)• Packaging • Software Components: The VDT• Grid level services

– OSG Grid Level Monitoring (MIS Technical Group)• Monitoring infrastructure• GridCat• MonALISA• MDS: Glue and the GIP• ACDC

– OSG Operations (Support Centers Technical Group)• Operations Model• Operations Operation in Grid3

– OSG Development and Integration (Integration Activity)• The OSG integration testbed• The OSG development cycle

– OSG Current State• OSG Grid • Schedule

• What is still missing– Grid scheduling– Managed Storage– Refinment of the middleware stack …

• Other Large Scale Grids– The LCG– TeraGrid

• Summary and Conclusion