Building, Monitoring and Maintaining a Grid
Jorge Luis RodriguezUniversity of Florida
[email protected] 11-15, 2005
Jorge Luis Rodriguez 2Grid Summer Workshop 2005, July 11-15
• What we’ve already learned– What are grids, why we want them and who is
using them: GSW intro & L1… – Grid Authentication and Authorization: L2– Harnessing CPU cycles with condor: L3– Data Management and the Grid: L4
• In this lecture – Fabric level infrastructure: Grid building blocks– The Open Science Grid
Introduction
Jorge Luis Rodriguez 3Grid Summer Workshop 2005, July 11-15
• Computational Clusters
• Storage Devices
• Networks
• Grid Resources and Layout:– User Interfaces– Computing Elements– Storage Elements– Monitoring Infrastructure…
Grid Building Blocks
Jorge Luis Rodriguez 4Grid Summer Workshop 2005, July 11-15
Dell Cluster at UFlorida’s High Performance Center
Computer ClustersCluster Management
“frontend”
Tape Backup robots
I/O Servers typically RAID fileserver
Disk Arrays The bulk are Worker Nodes
A few Headnodes, gatekeepers and
other service nodes
Jorge Luis Rodriguez 5Grid Summer Workshop 2005, July 11-15
A Typical Cluster Installation
Network Switch
Pentium III
Pentium III
Pentium III
Head Node/Frontend Server
Pentium III
Worker Nodes
WANWAN
Cluster Management• OS Deployment• Configuration• Many options
ROCKS (kickstart)OSCAR (sys imager)Sysconfig
• •
Computing Cycles Data Storage Connectivity I/O Node + Storage
Jorge Luis Rodriguez 6Grid Summer Workshop 2005, July 11-15
Networking• Internal Networks (LAN)
– Private, accessible only to servers inside a facility
– Some sites allow outbound connectivity via Network Address Translation
– Typical technologies used• Ethernet (0.1, 1 & 10 Gbps)• HP, Low Latency interconnects
– Myrinet: 2, 10 Gps– Infiniband: max at 120Gps
• External connectivity – Connection to Wide Area Network – Typically achieved via same
switching fabric as internal interconnects
Network Switch
Pentium III
Pentium III
Pentium III
Head Node/Frontend Server
Pentium III
Worker Nodes
WANWAN
“one planet one network”
Global Crossing
I/O Node + Storage
Jorge Luis Rodriguez 7Grid Summer Workshop 2005, July 11-15
The Wide Area NetworkEver increasing network capacities are what make grid
computing possible, if not inevitable
The Global Lambda Integrated Facility for Research and Education (GLIF)
Jorge Luis Rodriguez 8Grid Summer Workshop 2005, July 11-15
• Batch scheduling systems– Submit many jobs through a head node
#!/bin/shfor each i in $list_o_jobscriptsdo /usr/local/bin/condor_submit $idone
– Execution done on worker nodes
• Many different batch systems are deployed on the grid
– condor (highlighted in lecture 3)– pbs, lsf, sge…
Primary means of controlling CPU usage, enforcing allocation policies and scheduling of jobs on the local computing infrastructure
Computation on a Clusters
Network Switch
Pentium III
Pentium III
Pentium III
Head Node/Frontend Server
Pentium III
Worker Nodes
WANWAN
I/O Node + Storage
Jorge Luis Rodriguez 9Grid Summer Workshop 2005, July 11-15
Storage Devices Many hardware technologies deployed from:
Single fileserver• Linux box with lots of disk: RAID 5…• Typically used for work space and temporary space
a.k.a. “tactical storage”
to
Large Scale Mass Storage Systems• Large peta-scale disk + tape robots systems• Ex: FNAL’s Enstore MSS
– dCache disk frontend
– Powderhorn tape backend
• Typically used as permanent stores
a.k.a “strategic storage”
StorageTek Powderhorn Tape Silo
Jorge Luis Rodriguez 10Grid Summer Workshop 2005, July 11-15
Tactical Storage• Typical Hardware Components
– Servers: Linux, RAID controllers…– Disk Array
• IDE, SCSI, FC• RAID levels 5, 0, 50, 1…
• Local Access– Volumes mounted across compute
cluster• nfs, gpfs, afs…
– Volume Virtualization• dCache• pnfs
• Remote Access– gridftp: globus-url-copy– SRM interface
• space reservation• request scheduling
Network Switch
Pentium III
Pentium III
Pentium III
Head Node/Frontend Server
Pentium III
Worker Nodes
WANWAN
/share/DATA = nfs:/tmp1/share/TMP = nfs:/tmp2
/share/DATA = nfs:/tmp1/share/TMP = nfs:/tmp2
/tmp1/tmp2
Jorge Luis Rodriguez 11Grid Summer Workshop 2005, July 11-15
Layout of Typical Grid Site
Computing Fabric
Grid MiddlewareGrid Level Services
++
=>
A Grid Site
=>VDTVDT
OSGOSG
ComputeElement
StorageElement
User Interface
Authz server
Monitoring Element
Monitoring Clients Services
Data Management
Services
Grid Operations
The Gr id
Jorge Luis Rodriguez 12Grid Summer Workshop 2005 July 11-15
Grid3 and
Open Science Grid
Jorge Luis Rodriguez 13Grid Summer Workshop 2005, July 11-15
The Grid3 Project
Grid3: A Shared Grid Infrastructure for Scientific Research
Grid3: A Shared Grid Infrastructure for Scientific Research
Grid3: A Shared Grid Infrastructure for Scientific Research
Grid3: A Shared Grid Infrastructure for Scientific Research
High Energy Physics
Astronomy & AstrophysicsThe Sloan Digital Sky Survey (SDSS) is an ambitious project which plans
to map in detail one-quarter of the entire sky, determining positions and absolute brightness of more that 100 million celestial objects. It will also measure the distances to more than a million galaxies and quasars.
On Grid3 SDSS has run the maxBcg cluster finding program to measure distances and masses of clusters of galaxies. Recently SDSS has completed a run of the coadd application. This is a data intensive application that combines visual information from sources in SDSS surveys. Its goal is to search for interesting large scale structures, such as gravitational arcs behind distant clusters of galaxies.
High Energy Physics applications from three experiments are currently running on Grid3. The Large Hadron Collider (LHC) A Toriodal LHC ApparatuS (ATLAS) and Compact Muon Solenoid (CMS) experiments are running Monte Carlo simulations for their respective data challenges and have successfully processed millions of events on Grid3. These events have contributing significantly to the world-wide production efforts in preparation for Physics Technical Design Reports due in coming year. The BTeV collaboration has deployed and run its Monte Carlo simulation application on Grid3. BTeV is a proposed B physics experiment at Fermi National Laboratory.
ATLAS and CMS are also running analysis applications on Grid3. These applications operate in a significantly differently mode than do Monte Carlo production applications. They typically sort through large data sets of either Monte Carlo simulation or detector generated data in search of physics signals of interest. Example of important topics are searches for the Higgs boson, super symmetric particles and extra physical dimensions.
ATLAS
CMS
LIGO
SDSS
The Laser Interferometer Gravitational-Wave Observatory (LIGO) is a facility dedicated to the detection of cosmic gravitational waves and the harnessing of these waves for scientific research.
On Grid3 LIGO is analyzing recent data in search of signals from the inspiral of massive compact objects like neutron stars or black holes. Other analysis are searching for the continuous signals from thegravitational-wave equivalent of pulsars.
The CMS experiment being assembled by workers.
Artist rendition of gravitational waves emanating from a massive luminescent body
A collection of images from the SDSS date set
Grid3 ApplicationsHigh Energy Physics
Astronomy & AstrophysicsThe Sloan Digital Sky Survey (SDSS) is an ambitious project which plans
to map in detail one-quarter of the entire sky, determining positions and absolute brightness of more that 100 million celestial objects. It will also measure the distances to more than a million galaxies and quasars.
On Grid3 SDSS has run the maxBcg cluster finding program to measure distances and masses of clusters of galaxies. Recently SDSS has completed a run of the coadd application. This is a data intensive application that combines visual information from sources in SDSS surveys. Its goal is to search for interesting large scale structures, such as gravitational arcs behind distant clusters of galaxies.
High Energy Physics applications from three experiments are currently running on Grid3. The Large Hadron Collider (LHC) A Toriodal LHC ApparatuS (ATLAS) and Compact Muon Solenoid (CMS) experiments are running Monte Carlo simulations for their respective data challenges and have successfully processed millions of events on Grid3. These events have contributing significantly to the world-wide production efforts in preparation for Physics Technical Design Reports due in coming year. The BTeV collaboration has deployed and run its Monte Carlo simulation application on Grid3. BTeV is a proposed B physics experiment at Fermi National Laboratory.
ATLAS and CMS are also running analysis applications on Grid3. These applications operate in a significantly differently mode than do Monte Carlo production applications. They typically sort through large data sets of either Monte Carlo simulation or detector generated data in search of physics signals of interest. Example of important topics are searches for the Higgs boson, super symmetric particles and extra physical dimensions.
ATLAS
CMS
LIGO
SDSS
The Laser Interferometer Gravitational-Wave Observatory (LIGO) is a facility dedicated to the detection of cosmic gravitational waves and the harnessing of these waves for scientific research.
On Grid3 LIGO is analyzing recent data in search of signals from the inspiral of massive compact objects like neutron stars or black holes. Other analysis are searching for the continuous signals from thegravitational-wave equivalent of pulsars.
The CMS experiment being assembled by workers.
Artist rendition of gravitational waves emanating from a massive luminescent body
A collection of images from the SDSS date set
Grid3 Applications
Jorge Luis Rodriguez 14Grid Summer Workshop 2005, July 11-15
The Grid3 grid
A total of 35 sites
Over 3500 CPUs
Operations Center @ iGOC
Began operations Oct. 2003
A total of 35 sites
Over 3500 CPUs
Operations Center @ iGOC
Began operations Oct. 2003
Jorge Luis Rodriguez 15Grid Summer Workshop 2005, July 11-15
Grid3 Metrics
• CPU usage per VO 03/04 thru 09/04
• Data challenges for ATLAS and CMS
CPU usage averaged over a day 03/04-9/04
CMS’ DC04ATLAS’ DC2
Simulation Events
41.4 x 106 evts, 11/03-3/05
• Grid3 contribution to MC production for CMS
• USMOP = USCMS S&C + Grid3
Jorge Luis Rodriguez 16Grid Summer Workshop 2005, July 11-15
The Open Science Grid
A consortium of Universities and National Laboratories to building a sustainable grid
infrastructure for Science in the U.S.
…
Jorge Luis Rodriguez 17Grid Summer Workshop 2005, July 11-15
Grid3 Open Science Grid• Begin with iterative extension to Grid3
– Shared resources, benefiting broad set of disciplines– Realization of the critical need for operations– More formal organization needed because of scale
• Build OSG from laboratories, universities, campus grids– Argonne, Fermilab, SLAC, Brookhaven, Berkeley Lab, Jeff. Lab– UW Madison, U Florida, Purdue, Chicago, Caltech, Harvard, etc.
• Further develop OSG– Partnerships and contributions from other sciences, universities– Incorporation of advanced networking– Focus on general services, operations, end-to-end performance
Jorge Luis Rodriguez 18Grid Summer Workshop 2005, July 11-15
Enterprise
Technical Groups
ResearchGrid Projects
VOs
Researchers
Sites
Service Providers
Universities,Labs
activity1activity
1activity1Activities
Advisory Committee
Core OSG Staff(few FTEs, manager)
OSG Council(Chair, officers from
major stakeholders, PIs, Faculty & Lab managers)
Executive Board(8-15 representatives
Chair, Officers)
OSG Organization
Jorge Luis Rodriguez 19Grid Summer Workshop 2005, July 11-15
OSG Integration
Provisioning
OSG Activities and Tech. Groups
OSGdeployment
Policy
Privilege
TG-Policy
TG-Monitoring &
InformationDocs
DRM
IPB
dCache
TG-Storage
Operations
TG-Support Centers
MISMonitoring &
Information Systems
A Sampling of current OSG TG and Activities
Jorge Luis Rodriguez 20Grid Summer Workshop 2005, July 11-15
OSG Technical Groups & Activities• Technical Groups address and coordinate
– Propose and carry out activities related to their given areas
– Liaise & collaborate with other peer projects (U.S. & international)
– Participate in relevant standards organizations.– Chairs participate in Blueprint, Grid Integration and
Deployment activities• Activities are well-defined, scoped set of tasks
contributing to the OSG– Each Activity has deliverables and a plan– … is self-organized and operated– … is overseen & sponsored by one or more Technical
Groups
Jorge Luis Rodriguez 21Grid Summer Workshop 2005 July 11-15
OSG Authentication & Authorization
“Authz”
Privilege
Privilege
Jorge Luis Rodriguez 22Grid Summer Workshop 2005, July 11-15
Authentication & Authorization
• Authentication: Verify that you are who you say you are– OSG users typically use the DOEGrids CA – OSG sites also accept CAs from LCG and other
organizations
• Authorization: Allow a particular user to use a particular resource– Legacy Grid3 method, gridmap-files– New Privilege method, employed at primarily at US-
LHC sites
Jorge Luis Rodriguez 23Grid Summer Workshop 2005, July 11-15
OSG Authentication: Grid3 Style
VOMS server @ iGOC
edg-mkgridmap
VOMS server @ LLab
VOMS server @ OLab
user DNs
user DNs
user DNs
site a client
site b client
site n client
mapping of user’s grid credentials (DN) to local site group account
gridmap-file
gridmap-file
gridmap-file
LColab, Lexp1
Oexp1, Aexp2…
iVDGL, GADU…
DN mappings
VOMS= Virtual Organization Management SystemDN=Distinguished Nameedg= European Data Grid (EU grid project)
Jorge Luis Rodriguez 24Grid Summer Workshop 2005, July 11-15
The Privilege Project
Application of a
Role Based Access Control model for OSG
An advanced authorization mechanism
Jorge Luis Rodriguez 25Grid Summer Workshop 2005, July 11-15
The Privilege Project Provides
• Centralized, dynamic mapping of grid users to local OS qualifiers– VOMSes are still used as grid identities DN db– Gone, however are the static grid-mapfiles
• Improvement to Access Control policy implementation– The GUMS service– Access rights granted based on user’s
• VO membership• User selected role(s)
Grid Identity Unix ID Certificate DN local user VO Role(s) local group(s)
Grid Identity Unix ID Certificate DN local user VO Role(s) local group(s)
Jorge Luis Rodriguez 26Grid Summer Workshop 2005, July 11-15
Server with VDT 1.3 based on GT3.2Server with VDT 1.3 based on GT3.2
Server with VDT1.3 based on gt3.2
Client server (UI)
Web-Service Container
Privilege Project Components
VOMSServer
Servers with VDT 1.3 based on gt3.2
Gridmapcallout
gridFTP &
Gate-keeper
job-manager
PRIMAmodule
6. instantiates
GUMS Identity Mapping Service
(manages user accounts on
resources, incl. dynamic
allocation)
3. Standard globus-job-runrequest with VOMS-extended proxy
Client tool for role selection:
VOMS-proxy-init
1. VOMS-proxy-init requestwith specified role
2. Retrieves VO membership and role attribute
User Management(VOMSRS)
SAML Statement: Decision=Permit, with obligation local UID=xyz, GID=xyz
5. HTTPS/SOAP Response:
May user “Markus Lorch” of “VO=USCMS / Role=prod” access this resource?
4. HTTPS/SOAP Request: SAML Query:
VO membership synchronization
VOMSAttribute
Repository
An OSG site
A VO service
Jorge Luis Rodriguez 27Grid Summer Workshop 2005, July 11-15
Authorization in OSGOSG will support multiple modes of operation
• Will work with legacy client and/or server combinations– Full legacy server: VOMS+edg-makegridmap
• Privilege enabled client requests (VOMS-proxy-init) will be mapped to local user as previously: VOMS extensions ignored
– Full Privilege server: All Privilege components enabled• Legacy client requests are supported but user can not be mapped to a
different VO or assume a different role under its own VO
• Also a Grid3 compatibility mode will be supported– Supports privilege operations but without the globus PRIMA callout
and thus can not support the “role” based mapping – The gatekeeper/gridFTP server is operated with legacy Grid3 stack
(gtk 2.4 servers…)– Automatically provides reverse maps for Grid3/OSG accounting
Jorge Luis Rodriguez 28Grid Summer Workshop 2005, July 11-15
Server with VDT 1.3 based on GT3.2Server with VDT 1.3 based on GT3.2
Server with VDT1.3 based on gt3.2
Client server (UI)
Web-Service Container
Grid3 Compatibility mode
VOMSServer
Legacy Grid3 Servers
gridFTP &
Gate-keeper
job-manager
4. instantiates
GUMS Identity Mapping Service
(manages user accounts on
resources, incl. dynamic
allocation)
3. Standard globus-job-runrequest with VOMS-extended proxy
Client tool for role selection:
VOMS-proxy-init
1. VOMS-proxy-init request
2. Retrieves VO membership and role attribute
User Management(VOMSRS)
May user “Markus Lorch” of “VO=USCMS / Role=prod” access this resource? The role however, is ignored
A. Periodically queries GUMS server
VO membership synchronization
VOMSAttribute
Repository
An OSG site with legacy gatekeeper or gridFTP server
A VO service
grid-mapfile
cron job
DN mapping to a local UID=xxx, but no role based assignment
B. Response create static grid-mapfile
reversemap
to OSG accounting
Jorge Luis Rodriguez 29Grid Summer Workshop 2005 July 11-15
OSG Grid Monitoring
MISMonitoring &
Information Systems
MISMonitoring &
Information Systems
Jorge Luis Rodriguez 30Grid Summer Workshop 2005, July 11-15
OSG Monitoring & Information System
• Grid Level Clients– User interfaces & APIs– Graphical Displays
• Gridcat • MonALISA• ACDC job Monitor• Metrics DataViewer• BDII
• Site Level infrastructure– Scripts & tools collectors – Databases – Application APIs
• MonALISA server– MonALISA client– Metrics DataViewer
• OSG MIS-CI– ACDC job Monitor– GidCat
• MDS: Generic Info. Provider– BDII
Jorge Luis Rodriguez 31Grid Summer Workshop 2005, July 11-15
Site Level Infrastructure
stor_stat
Ganglia
…
GIP
job_stateMonitoringinformationDataBase
Collector
others…
MonALISA
DiscoveryService
ACDC
GINI, SOAP, WDSL…
GRAM: jobman-mis
https: Web Services
GridCat
• MonALISA server • MIS-Core Infrastructure• MDS
Monitoring Information
Consumer API
HistoricalinformationDataBase
Jorge Luis Rodriguez 32Grid Summer Workshop 2005, July 11-15
OSG Grid Level Clients
• Tools provide basic information about OSG resources– Resource catalog: official tally of OSG sites– Resource discovery: what services are available,
where are they and how do I access it – Metrics Information: Usage of resources over time
• Used to asses scheduling priorities– Where and when should I send my jobs?– Where can I put my output?
• Used to monitor health and status of the Grid
Jorge Luis Rodriguez 33Grid Summer Workshop 2005, July 11-15
GridCathttp://osg-cat.grid.iu.edu
http://www.ivdgl.org/grid3/gridcat
Jorge Luis Rodriguez 34Grid Summer Workshop 2005, July 11-15
MonALISA
Jorge Luis Rodriguez 35Grid Summer Workshop 2005 July 11-15
OSG Provisioning
• OSG Software Cache • OSG Meta Packager• Grid Level Services
Provisioning
Provisioning
Jorge Luis Rodriguez 36Grid Summer Workshop 2005, July 11-15
The OSG Software Cache
• Most software comes from the VDT
• OSG components include– VDT configuration scripts – Some OSG specific packages too
• Pacman is the OSG Meta-packager– This is how we deliver the entire cache to
resources
Jorge Luis Rodriguez 37Grid Summer Workshop 2005, July 11-15
What is The VDT ?• A collection of software
– Grid software – Virtual data software– Utilities
• An easy installation mechanism– Goal: Push a button, everything just works– Two methods:
• Pacman: installs and configures it all• RPM: installs some of the software, but no configuration
• A support infrastructure– Coordinate bug fixing– Help desk
Jorge Luis Rodriguez 38Grid Summer Workshop 2005, July 11-15
What is in the VDT?Condor Group
Condor/Condor-G
DAGMan
Fault Tolerant Shell
ClassAds
NeST
Globus Alliance (3.2 pre web)
Job submission (GRAM)
Information service (MDS)
Data transfer (GridFTP)
Replica Location (RLS)
EDG & LCG
Make Gridmap
Cert. Revocation list updater
Glue & Gen. Info. provider
VOMS
Condor Group
Condor/Condor-G
DAGMan
Fault Tolerant Shell
ClassAds
NeST
Globus Alliance (3.2 pre web)
Job submission (GRAM)
Information service (MDS)
Data transfer (GridFTP)
Replica Location (RLS)
EDG & LCG
Make Gridmap
Cert. Revocation list updater
Glue & Gen. Info. provider
VOMS
ISI & UC
Chimera & Pegasus
NCSA
MyProxy
GSI OpenSSH
UberFTP
LBL
PyGlobus
Netlogger
DRM
Caltech
MonALISA
jClarens (WSR)
VDT
VDT System Profiler
Configuration software
ISI & UC
Chimera & Pegasus
NCSA
MyProxy
GSI OpenSSH
UberFTP
LBL
PyGlobus
Netlogger
DRM
Caltech
MonALISA
jClarens (WSR)
VDT
VDT System Profiler
Configuration software
US LHC
GUMS
PRIMA
Others
KX509 (U. Mich.)
Java SDK (Sun)
Apache HTTP/Tomcat
MySQL
Optional packages
Globus-Core {build}
Globus job-manager(s)
US LHC
GUMS
PRIMA
Others
KX509 (U. Mich.)
Java SDK (Sun)
Apache HTTP/Tomcat
MySQL
Optional packages
Globus-Core {build}
Globus job-manager(s)
Condor Group
Condor/Condor-G
DAGMan
Fault Tolerant Shell
ClassAds
NeST
Globus Alliance (3.2 pre web)
Job submission (GRAM)
Information service (MDS)
Data transfer (GridFTP)
Replica Location (RLS)
EDG & LCG
Make Gridmap
Cert. Revocation list updater
Glue & Gen. Info. provider
VOMS
Condor Group
Condor/Condor-G
DAGMan
Fault Tolerant Shell
ClassAds
NeST
Globus Alliance (3.2 pre web)
Job submission (GRAM)
Information service (MDS)
Data transfer (GridFTP)
Replica Location (RLS)
EDG & LCG
Make Gridmap
Cert. Revocation list updater
Glue & Gen. Info. provider
VOMS
ISI & UC
Chimera & Pegasus
NCSA
MyProxy
GSI OpenSSH
UberFTP
LBL
PyGlobus
Netlogger
DRM
Caltech
MonALISA
jClarens (WSR)
VDT
VDT System Profiler
Configuration software
ISI & UC
Chimera & Pegasus
NCSA
MyProxy
GSI OpenSSH
UberFTP
LBL
PyGlobus
Netlogger
DRM
Caltech
MonALISA
jClarens (WSR)
VDT
VDT System Profiler
Configuration software
US LHC
GUMS
PRIMA
Others
KX509 (U. Mich.)
Java SDK (Sun)
Apache HTTP/Tomcat
MySQL
Optional packages
Globus-Core {build}
Globus job-manager(s)
US LHC
GUMS
PRIMA
Others
KX509 (U. Mich.)
Java SDK (Sun)
Apache HTTP/Tomcat
MySQL
Optional packages
Globus-Core {build}
Globus job-manager(s)
Core software
User Interface
Computing Element
Storage Element
Authz System
Monitoring System
Jorge Luis Rodriguez 39Grid Summer Workshop 2005, July 11-15
Pacman• Pacman is:
– A software environment installer (or Meta-Packager)– A a language for defining software environments– An interpreter that allowing creation, installation, configuration,
update, verification and repair such environments
• Pacman makes installation of all types of software easy
LCG/Scram
ATLAS/CMT
Globus/GPT Nordugrid/RPM
LIGO/tar/make D0/UPS-UPD
CMS DPE/tar/make
NPACI/TeraGrid/tar/make
OpenSource/tar/make
Commercial/tar/make
% pacman –get iVDGL:Grid3
Enables us to easily and coherently combine and manage software from arbitrary sources.
ATLASATLAS
NPACINPACI
D-ZeroD-Zero
iVDGLiVDGL
UCHEPUCHEPVDT
CMS/DPE
LIGO Enables remote experts to define installation config updating for everyone at once.
Jorge Luis Rodriguez 40Grid Summer Workshop 2005, July 11-15
Pacman Installation
1. Download Pacman– http://physics.bu.edu/~youssef/pacman/
2. Install the “package”– cd <install-directory>– pacman -get OSG:OSG_CE_0.2.1– ls
condor/ globus/ post-install/ setup.sh
edg/ gpt/ replica/ vdt/
ftsh/ perl/ setup.csh vdt-install.log
/monalisa ...
Jorge Luis Rodriguez 41Grid Summer Workshop 2005, July 11-15
Grid Level Services
• OSG Grid level Monitoring infrastructure– Mon. & Info. System(s) top level database– Monitoring Authz: mis user
• OSG Operations infrastructure– Websites, Ops page, web servers …– Trouble ticket system
• OSG top level Replica Location Service
Jorge Luis Rodriguez 42Grid Summer Workshop 2005 July 11-15
OSG Operations
Operations
Operations
Jorge Luis Rodriguez 43Grid Summer Workshop 2005, July 11-15
Grid Operations
• Monitoring Grid status– Use of Grid monitors and verification routines
• Report, route and track problems and resolution – Trouble ticket system
• Repository of resource contact information
Do this as part of a National distributed system
Monitoring and Maintaining the Health of the Grid
User supportApplication supportVO issues
Jorge Luis Rodriguez 44Grid Summer Workshop 2005, July 11-15
Operations Model in OSG
Jorge Luis Rodriguez 45Grid Summer Workshop 2005, July 11-15
Ticket Routing in OSG
1 2
3
4
5
678
910
11
12 OSG infrastructureSC private infrastructure
User in VO1 notices problem at RP3, notifies their SC (1).SC-C opens ticket (2) and assigns to SC-F.SC-F gets automatic notice (3) and contacts RP3 (4).Admin at RP3 fixes and replies to SC-F (5).SC-F notes resolution in ticket (6).SC-C gets automatic notice of update to ticket (7).SC-C notifies user of resolution (8).User confirms resolution (9).SC-C closes ticket (10).SC-F gets automatic notice of closure (11).SC-F notifies RP3 of closure (12).
Jorge Luis Rodriguez 46Grid Summer Workshop 2005 July 11-15
OSG Integration and Development
IntegrationIntegration
Jorge Luis Rodriguez 47Grid Summer Workshop 2005, July 11-15
Path for New Services in OSGO
SG
Dep
loym
ent
Act
ivit
y
Metrics &Certification
Applicationvalidation
VO Application
SoftwareInstallation
OSG Integration Activity
Release Description
MiddlewareInteroperability
Software &packaging
Functionality & Scalability
Tests
Readiness plan adopted
Service deployment
OS
G O
per
atio
ns-
Pro
visi
on
ing
Act
ivit
y
ReleaseCandidate
Readinessplan
Effort
Resources
feedback
Jorge Luis Rodriguez 48Grid Summer Workshop 2005, July 11-15
OSG Integration Testbed Layout
Stable ProductionRelease
IntegrationRelease
Resources enterand leave as necessary
OSG Integration Testbed
Serviceplatform
applications, test harness, clients
VO contributed
Jorge Luis Rodriguez 49Grid Summer Workshop 2005, July 11-15
OSG Integration Testbed
Jorge Luis Rodriguez 50Grid Summer Workshop 2005 July 11-15
OSG Further Work
• Managed Storage
• Grid Scheduling
• More
Jorge Luis Rodriguez 51Grid Summer Workshop 2005, July 11-15
Managing Storage
• Problems: – No real way to control the movement of files
into and out of site• Data is staged by fork processes! • Anyone with access to the site can submit such a
request
– There is also no space allocation control• A grid user can dump files of any size on a
resource• If users do not cleanup sys admin have to
interveneThese can easily overwhelm a resource
Jorge Luis Rodriguez 52Grid Summer Workshop 2005, July 11-15
Managing Storage
• A Solution: SRM (Storage Resource Manager) • Grid enabled interface to put data on a site
– Provides scheduling of data transfer requests– Provides reservation of storage space
• Technologies in the OSG pipeline– dCache/SRM (disk cache with SRM)
• Provided by DESY & FNAL• SE(s) available to OSG as a service from the USCMS VO
– DRM (Disk Resource Manager) • Provided by LBL• Can be added on top of a normal UNIX file system
DRM
dCache
TG-Storage
$> globus-url-copy srm://ufdcache.phys.ufl.edu/cms/foo.rfz \ gsiftp://cit.caltech.edu/data/bar.rfz
Jorge Luis Rodriguez 53Grid Summer Workshop 2005, July 11-15
Grid SchedulingThe problem: With job submission this still happens!
OSG Site B
OSG Site AUser InterfaceVDT Client
?
OSG Site X
Why do I have to do this by hand? @?>#^%$@#
Jorge Luis Rodriguez 54Grid Summer Workshop 2005, July 11-15
Grid Scheduling
• Possible Solutions– Sphinx (GriPhyN, UF)
• Work flow based dynamic planning (late binding)• Policy based scheduling • More details ask Laukik
– Pegasus (GriPhyN, ISI/UC)• DAGman based planner and Grid scheduling (early binding)• More details in Work Flow lecture 6
– Resource Broker (LCG)• Match maker based Grid scheduling• Employed by application running on LCG Grid resources
Jorge Luis Rodriguez 55Grid Summer Workshop 2005, July 11-15
Much Much More is Needed
• Continue the hardening of middleware and other software components
• Continue the process of federating with other Grids– TeraGrid– LHC/EGEE
• Continue to synchronize the Monitoring and Information Service Infrastructure
• Improve documentation• • •
TG-Interoperability
Jorge Luis Rodriguez 56Grid Summer Workshop 2005 July 11-15
Summary and
Conclusions
Jorge Luis Rodriguez 57Grid Summer Workshop 2005, July 11-15
Conclude with a simple example1. Log on to a User Interface; 2. Get your grid proxy “logon to the grid” grid-proxy-init3. Check OSG MIS clients
• To get list of available sites: depends on your VO affiliation• To discover site specific information needed by your job ie,
• Available services: hostname, port numbers• Tactical storage location: $app, $data, $tmp, $wntmp
4. Install your application bins at selected sites5. Submit your jobs to selected sites via condor-G6. Check OSG MIS clients to see if jobs have completed7. Do something like this: If [ 0 ] then echo “Have a coffee (beer, margarita…)” else echo “its going to be a long night” fi
Jorge Luis Rodriguez 58Grid Summer Workshop 2005, July 11-15
OSG Ribbon CuttingOSG Opens for Business on July 20th, 2005
All
Some
OSG-Integration Testbed
Grandfathered Grid3
Jorge Luis Rodriguez 59Grid Summer Workshop 2005 July 11-15
Conclusion
Lots of progress since 1999
But a lot of work still remains!
http://www.opensciencegrid.orghttp://www.griphyn.org
http://www.ivdgl.orghttp://www.ppdg.org
Jorge Luis Rodriguez 60Grid Summer Workshop 2005 July 11-15
The End
Jorge Luis Rodriguez 61Grid Summer Workshop 2005, July 11-15
• Grid3 and the Open Science Grid– Grid3: A Shared Grid Infrastructure for Scientific Applications
• Introduction• The Grid3 Grid• Grid Metrics
– The Open Science Grid• The OSG Consortium• OSG organization• OSG Technical Groups and Activities
– OSG Provisioning (Provision Activity)• Packaging • Software Components: The VDT• Grid level services
– OSG Grid Level Monitoring (MIS Technical Group)• Monitoring infrastructure• GridCat• MonALISA• MDS: Glue and the GIP• ACDC
– OSG Operations (Support Centers Technical Group)• Operations Model• Operations Operation in Grid3
– OSG Development and Integration (Integration Activity)• The OSG integration testbed• The OSG development cycle
– OSG Current State• OSG Grid • Schedule
• What is still missing– Grid scheduling– Managed Storage– Refinment of the middleware stack …
• Other Large Scale Grids– The LCG– TeraGrid
• Summary and Conclusion