View
216
Download
0
Category
Tags:
Preview:
Citation preview
1
© Cluster Resources, Inc. 1
Cluster Resources Training
1
© Cluster Resources, Inc. 2
Outline1. Moab Overview2. Deployment3. Diagnostics and Troubleshooting4. Integration5. Scheduling Behaviour6. Resource Access7. Grid Computing8. Accounting9. Transitioning from LCRM10. End Users
1
© Cluster Resources, Inc. 3
1. Moab Introduction
• Overview of the Modern Cluster• Cluster Evolution• Cluster Productivity Losses• Moab Workload Manager Architecture• What Moab Does• What Moab Does Not Do
1
© Cluster Resources, Inc. 4
Cluster Stack / Framework:
Cluster Workload Manager: Scheduler, Policy Manager, Integration PlatformCluster Workload Manager: Scheduler, Policy Manager, Integration Platform
Message PassingMessage Passing
SerialSerialParallelParallel ApplicationApplication
Resource ManagerResource Manager
Grid Workload Manager: Scheduler, Policy Manager, Integration PlatformGrid Workload Manager: Scheduler, Policy Manager, Integration Platform
Operating SystemOperating System
Hardware (Cluster or SMP)Hardware (Cluster or SMP)
PortalPortal
CLICLI
GUIGUI
ApplicationApplication
AdminAdmin UsersUsers
Se
cu
rityS
ec
urity
1
© Cluster Resources, Inc. 5
Resource Manager (RM)• While other systems may have more strict interpretations
of a resource manager and its responsibilities, Moab's multi-resource manager support allows a much more liberal interpretation. – In essence, any object which can provide environmental information
and environmental control can be utilized as a resource manager.
• Moab is able to aggregate information from multiple unrelated sources into a larger more complete world view of the cluster which includes all the information and control found within a standard resource manager such as TORQUE including:– Node– Job– Queue management services.
1
© Cluster Resources, Inc. 6
The Evolved Cluster
ResourceManager
MOAB
Compute Nodes
Admin
User
LicenseManager
JobQueue
Myrinet
IdentityManager
AllocationManager
ResourceManager
MOAB
Remote Site
1
© Cluster Resources, Inc. 7
Moab Architecture
1
© Cluster Resources, Inc. 8
What Moab Does
• Optimizes Resource Utilization with Intelligent Scheduling and Advanced Reservations
• Unifies Cluster Management across Varied Resources and Services
• Dynamically Adjusts Workload to Enforce Policies and Service Level Agreements
• Automates Diagnosis and Failure Response
1
© Cluster Resources, Inc. 9
What Moab Does Not Do
• Does not does do resource management (usually)
• Does not install the system (usually)
• Not a storage manager
• Not a license manager
• Does not do message passing
1
© Cluster Resources, Inc. 10
2. Deployment
• Installation
• Configuration
• Testing
1
© Cluster Resources, Inc. 11
Moab Workload Manager Installation
> tar -xzvf moab-4.5.0p0.linux.tar.gz
> cd moab-4.5.4
> ./configure
> make
• When you are ready to use Moab in production, you may install it into the install directory you have configured using make install.
• Workload Manager must be running before Cluster Manager and Access Portal will work.
• You only install Moab Workload Manager on the head node.
• You can choose to install client commands on a remote system as well.
1
© Cluster Resources, Inc. 12
File Locations• $(MOABHOMEDIR)
– moab.cfg (general config file containing information required by both the Moab server and user interface clients)
– moab-private.cfg (config file containing private information required by the Moab server only)
– .moab.ck (Moab checkpoint file) – .moab.pid (Moab 'lock' file to prevent multiple instances) – log(directory for Moab log files - REQUIRED BY DEFAULT)
• moab.log (Moab log file) • moab.log.1 (previous 'rolled' Moab log file)
– stats(directory for Moab statistics files - REQUIRED BY DEFAULT) • Moab stats files (in format 'stats.<YYYY>_<MM>_<DD>') • Moab fairshare data files (in format 'FS.<EPOCHTIME>')
– tools (directory for local tools called by Moab - OPTIONAL BY DEFAULT) – traces (directory for Moab simulation trace files - REQUIRED FOR SIMULATIONS)
• resource.trace1 (sample resource trace file) • workload.trace1 (sample workload trace file)
1
© Cluster Resources, Inc. 13
– spool (directory for temporary Moab files - REQUIRED FOR ADVANCED FEATURES)
– contrib (directory containing contributed code in the areas of GUI's, algorithms, policies, etc)
• $(MOABINSTDIR)
– bin (directory for installed Moab executables) • moab (Moab scheduler executable) • mclient (Moab user interface client executable)
• /etc/moab.cfg (optional file. This file is used to override default '$(MOABHOMEDIR)' settings. It should contain the string 'MOABHOMEDIR $(DIRECTORY)' to override the 'built-in' $(MOABHOMEDIR)' setting.
1
© Cluster Resources, Inc. 14
Initial Configuration – moab.cfg• moab.cfg contains the parameters and settings for Moab
Workload Manager. This is where you will set most of the policy settings.
Example of what moab.cfg will look like after installation:
##moab.cfg
SCHEDCFG[Moab] SERVER=test.icluster.org:4255
ADMINCFG[1] USERS=root
RMCFG[base] TYPE=PBS
1
© Cluster Resources, Inc. 15
Supported Platforms/Environments• Resource Managers
– TORQUE, OpenPBS, PBSPro, LSF, Loadleveler, SLURM, BProc, clubMASK, S3, WIKI
• Operating Systems– RedHat, SUSE, Fedora, Debian, FreeBSD, (+ all known
variants of Linux), AIX, IRIX, HP-UX, OS/X, OSF/Tru-64, SunOS, Solaris, (+ all known variants of UNIX)
• Hardware– Intel x86, Intel IA-32, Intel IA-64, AMD x86, AMD Opteron,
SGI Altix, HP, IBM SP, IBM x-Series, IBM p-Series, IBM i-Series, Mac G4 and G5
1
© Cluster Resources, Inc. 16
Basic Parameters• SCHEDCFG
– Specifies how the Moab server will execute and communicate with client requests.
• Example: SCHEDCFG[orion] SERVER=cw.psu.edu
• ADMINCFG – Moab provides role-based security enabled by way of multiple
levels of admin access. • Example: The following may be used to enable users greg amd thomas
as level 1 admins: – ADMINCFG[1] USERS=greg,thomas NOTE: Moab may only be
launched by the primary admin user id.
• RMCFG – In order for Moab to properly interact with a resource manager, the
interface to this resource manager must be defined.• For example: To interface to a TORQUE resource manager, the
following may be used: – RMCFG[torque1] TYPE=pbs
1
© Cluster Resources, Inc. 17
Scheduling Modes
• Simulation Mode– Allows a test drive of the scheduler. You can evaluate how various policies can
improve the current performance on a stable production system.
• Test Mode– Test mode allows evaluation of new Moab releases, configurations, and policies in a
risk-free manner. the test-mode Moab behaves identical to a live or normal mode except the ability to start, cancel, or modify jobs.
• Normal Mode– Live (after installation, automatically set this way)
• Interactive Mode– Like test mode but instead of disabling all resource and job control functions, Moab
sends the desired change request to the screen and asks for permission to complete it.
- Configure modes in moab.cfg
1
© Cluster Resources, Inc. 18
Testing New Policies
• Verifying Correct Specification of New Policies– If manually editing the moab.cfg file, use the mdiag –C
command– Moab Cluster Manager automatically verifies proper policy
specification
• Verifying Correct Behavior of New Policies– Put in INTERACTIVE Mode to ensure you want to make
each change
• Determining Long Term Impact of New Policies– Put in SIMULATION Mode
1
© Cluster Resources, Inc. 19
• Moab 'Side-by-Side‘– Allows a production cluster or other resource to be logically
partitioned along resource and workload boundaries and allows different instances of Moab to schedule different partitions.
• Use parameters: IGNORENODES, IGNORECLASSES, IGNOREUSERS
##moab.cfg for production partition
SCHEDCFG[prod] MODE=NORMAL SERVER=orion.cxz.com:42020 RMCFG[TORQUE] TYPE=PBS
IGNORENODES node61,node62,node63,node64 IGNOREUSERS gridtest1,gridtest2
##moab.cfg for test partition
SCHEDCFG[prod] MODE=NORMAL SERVER=orion.cxz.com:42020 RMCFG[TORQUE] TYPE=PBS
IGNORENODES !node61,node62,node63,node64 IGNOREUSERS !gridtest1,gridtest2
1
© Cluster Resources, Inc. 20
Simulation• What is the impact of additional hardware on cluster utilization? • What delays to key projects can be expected with the addition of
new users? • How will new prioritization weights alter cycle distribution among
existing workload? • What total loss of compute resources will result from introducing
a maintenance downtime? • Are the benefits of cycle stealing from non-dedicated desktop
systems worth the effort? • How much will anticipated grid workload delay the average wait
time of local jobs?
1
© Cluster Resources, Inc. 21
Scheduling Iterations
• Update State Information • Refresh Reservations • Schedule Reserved Jobs • Schedule Priority Jobs • Backfill Jobs • Update Statistics • Handle User Requests• Perform Next Scheduling Cycle
1
© Cluster Resources, Inc. 22
Job Flow
• Determine Basic Job Feasibility
• Prioritize Jobs
• Enforce Configured Throttling Policies
• Determine Resource Availability
• Allocate Resources to Job
• Launch Job
1
© Cluster Resources, Inc. 23
Command Description
checkjob provide detailed status report for specified job
checknode provide detailed status report for specified node
mcredctl controls various aspects about the credential objects within Moab
mdiag provide diagnostic reports for resources, workload, and scheduling
mjobctl control and modify job
mnodectl control and modify nodes
mrmctl query and control resource managers
mrsvctl control and modify reservations
mschedctl modify scheduler state and behavior
mshow displays various diagnostic messages about the system and job queues
msub scheduler job submission
resetstats reset scheduler statistics
showbf show current resource availability
showq show queued jobs
showres show existing reservations
showstart show estimates of when job can/will start
showstate show current state of resources
showstats show usage statistics
Commands Overview
1
© Cluster Resources, Inc. 24
End User CommandsCommand Flags Description
canceljob cancel existing job
checkjob display job state, resource requirements, environment, constraints, credentials, history, allocated resources, and resource utilization
showbf show resource availability for jobs with specific resource requirements
showq display detailed prioritized list of active and idle jobs
showstart show estimated start time of idle jobs
showstats show detailed usage statistics for users, groups, and accounts which the end user has access to
1
© Cluster Resources, Inc. 25
Scheduling Objects
• Moab functions by manipulating five primary, elementary objects: – Jobs – Nodes – Reservations – Policies
1
© Cluster Resources, Inc. 26
Jobs• Job information is provided to the Moab scheduler
from a resource manager– (Such as Loadleveler, PBS, Wiki, or LSF)
• Job attributes include ownership of the:– Job– Job state– Amount– Type of resources required by the job– Wallclock limit
• A job consists of one or more requirements each of which requests a number of resources of a given type.
1
© Cluster Resources, Inc. 27
Nodes• Within Moab, a node is a collection of
resources with a particular set of associated attributes.
• A node is defined as one or more CPU's, together with associated memory, and possibly other compute resources such as local disk, swap, network adapters, software licenses, etc.
1
© Cluster Resources, Inc. 28
Advance Reservations• An object which dedicates a block of specific
resources for a particular use. • Each reservation consists of a list of
resources, an access control list, and a time range for which this access control list will be enforced.
• The reservation prevents the listed resources from being used in a way not described by the access control list during the time range specified.
1
© Cluster Resources, Inc. 29
Resource Managers
• Moab can be configured to manage more than one resource manager simultaneously, even resource managers of different types.
• Moab aggregates information from the RMs to fully manage workload, resources, and cluster policies
1
© Cluster Resources, Inc. 30
3 Troubleshooting and Diagnostics
• Object Messages• Diagnostic Commands• Admin Notification• Logging• Tracking System Failures• Checkpointing• Debuggers
http://www.clusterresources.com/products/mwm/moabdocs/14.0troubleshootingandsysmaintenance.shtml
1
© Cluster Resources, Inc. 31
Object Messages• Messages can hold information regarding failures and key events
• Messages possess event time, owner, expiration time, and event count information
• Resource managers and peer services can attach messages to objects
• Admins can attach messages
• Multiple messages per object are supported
• Messages are persistent
http://www.clusterresources.com/products/mwm/moabdocs/commands/mschedctl.shtml
http://www.clusterresources.com/products/mwm/moabdocs/14.3messagebuffer.shtml
1
© Cluster Resources, Inc. 32
Diagnostics
• Moab’s diagnostic commands present detailed state information– Scheduling problems
– Summarize performance
– Evaluate current operation reporting on any unexpected or potentially erroneous conditions
– Where possible correct detected problems if desired
1
© Cluster Resources, Inc. 33
mdiag
• Displays object state/health
• Displays object configuration– Attributes, resources, policies
• Displays object history and performance
• Displays object failures and messages
http://www.clusterresources.com/products/mwm/moabdocs/commands/mdiag.shtml
1
© Cluster Resources, Inc. 34
mdiag usage• Most common diagnostics
– Scheduler (mdiag –S)
– Jobs (mdiag –j)
– Nodes (mdiag –n)
– Resource manager (mdiag –R)
– Blocked jobs (mdiag –b)
– Configuration (mdiag –C)
• Other diagnostics– Fairshare, Priority
– Users, Accounts, Classes
– Reservations, QoS, etc
http://www.clusterresources.com/products/mwm/moabdocs/commands/mdiag.shtml
1
© Cluster Resources, Inc. 35
mdiag details• Performs numerous internal health and
consistency checks– Race conditions, object configuration inconsistencies,
possible external failures
• Not just for failures
• Provides status, config, and current performance
• Enables moab as an information service--flags=xml
1
© Cluster Resources, Inc. 36
Job TroubleshootingTo determine why a particular job will not start, there are several commands which
can be helpful:• checkjob -v
– Checkjob will evaluate the ability of a job to start immediately. Tests include resource access, node state, job constraints (ie, startdate, taskspernode, QOS, etc). Additionally, command line flags may be specified to provide further information.
• -l <POLICYLEVEL> // evaluate impact of throttling policies on job feasibility -n <NODENAME> // evaluate resource access on specific node -r <RESERVATION_LIST> // evaluate access to specified reservations
• checknode– Display detailed status of node
• mdiag -b – Display various reasons job is considered 'blocked' or 'non-queued'.
• mdiag -j – Display high level summary of job attributes and perform sanity check on job
attributes/state.
• showbf -v – Determine general resource availability subject to specified constraints.
1
© Cluster Resources, Inc. 37
Other Diagnostics
• checkjob and checknode commands – Why a job cannot start– Which nodes can be available
information regarding the recent events impacting current job
– Nodes state
1
© Cluster Resources, Inc. 38
Issues with Client Commands
• Utilize built in moab logging – showq --loglevel=9
Or
• Check the moab log files
1
© Cluster Resources, Inc. 39
Logging Facilities• Moab Log
– Report detailed scheduler actions, configuration, events, failures, etc
• Event Log– Report scheduler, job, node, and reservation events and failures
• Syslog– USESYSLOG
• http://www.clusterresources.com/products/mwm/moabdocs/a.fparameters.shtml#eventrecordlist
• http://www.clusterresources.com/products/mwm/moabdocs/14.2logging.shtml
• http://www.clusterresources.com/products/mwm/moabdocs/a.fparameters.shtml#usesyslog
# stats/events.Wed_Aug_24_20051124979598 rm base RMUP initialized1124979598 sched Moab SCHEDSTART -1124982013 node node017 GEVENT CPU2 Down1124989457 node node135 GEVENT /var/tmp Full1124996230 node node139 GEVENT /home Full1125013524 node node407 GEVENT Transient Power Supply Failure
1
© Cluster Resources, Inc. 40
Logging Basics
• LOGDIR - Indicates directory for log files
• LOGFILE - Indicates path name of log file
• LOGFILEMAXSIZE - Indicates maximum size of log file before rolling
• LOGFILEROLLDEPTH - Indicates maximum number of log files to maintain \
• LOGLEVEL - Indicates verbosity of logging
1
© Cluster Resources, Inc. 41
• In source and debug releases, each subroutine is logged, along with all printable parameters.
Function Level Information
##moab.log
MPolicyCheck(orion.322,2,Reason)
1
© Cluster Resources, Inc. 42
Status Information• Information about internal status is logged at all
LOGLEVELs. Critical internal status is indicated at low LOGLEVELs while less critical and more vebose status information is logged at higher LOGLEVELs.
##moab.log
INFO: job orion.4228 rejected (max user jobs) INFO: job fr4n01.923.0 rejected (maxjobperuser policy failure)
1
© Cluster Resources, Inc. 43
Scheduler Warnings• Warnings are logged when the scheduler detects an
unexpected value or receives an unexpected result from a system call or subroutine.
##moab.log
WARNING: cannot open fairshare data file '/opt/moab/stats/FS.87000'
1
© Cluster Resources, Inc. 44
Scheduler Alerts• Alerts are logged when the scheduler detects events
of an unexpected nature which may indicate problems in other systems or in objects.
##moab.log
ALERT: job orion.72 cannot run. deferring job for 360 Seconds
1
© Cluster Resources, Inc. 45
Scheduler Errors• Errors are logged when the scheduler detects
problems of a nature of which impact the scheduler's ability to properly schedule the cluster.
##moab.log
ERROR: cannot connect to Loadleveler API
1
© Cluster Resources, Inc. 46
Searching Moab Logs• While major failures will be reported via the mdiag -S
command, these failures can also be uncovered by searching the logs using the grep command as in the following:
> grep -E "WARNING|ALERT|ERROR" moab.log
1
© Cluster Resources, Inc. 47
Event Logs• Major events are reported to both the Moab log file as
well as the Moab event log. By default, the event log is maintained in the statistics directory and rolls on a daily basis, using the naming convention:– events.WWW_MMM_DD_YYYY (e.g. events.Fri_Aug_19_2005)
##event log format<EPOCHTIME> <OBJECT> <OBJECTID> <EVENT> <DETAILS>
1
© Cluster Resources, Inc. 48
Enabling Syslog
• In addition to the log file, the Moab Scheduler can report events it determines to be critical to the UNIX syslog facility via the daemon facility using priorities ranging from INFO to ERROR.
• The verbosity of this logging is not affected by the LOGLEVEL parameter. In addition to errors and critical events, user commands that affect the state of the jobs, nodes, or the scheduler may also be logged to syslog.
• Moab syslog messages are reported using the INFO, NOTICE, and ERR syslog priorities.
1
© Cluster Resources, Inc. 49
Tracking System Failures• The scheduler has a number of dependencies which may cause
failures if not satisfied.
• Disk Space• The scheduler utilizes a number of files. If the file system is full or
otherwise inaccessible, the following behaviors might be noted:
File Failure
moab.pid scheduler cannot perform single instance check
moab.ck* scheduler cannot store persistent record of reservations, jobs, policies, summary statistics, etc.
moab.cfg/moab.dat scheduler cannot load local configuration
log/* scheduler cannot log activities
stats/* scheduler cannot write job records
1
© Cluster Resources, Inc. 50
Checkpointing
• Moab checkpoints its internal state. The checkpoint file records statistics and attributes for jobs, nodes, reservations, users, groups, classes, and almost every other scheduling object.
• CHECKPOINTEXPIRATIONTIME - Indicates how long unmodified data should be kept after the associated object has disappeared. ie, job priority for a job no longer detected.
– FORMAT - [[[DD:]HH:]MM:]SS
– EXAMPLE - CHECKPOINTEXPIRATIONTIME 1:00:00:00
• CHECKPOINTFILE - Indicates path name of checkpoint file– FORMAT - <STRING>
– EXAMPLE - CHECKPOINTFILE /var/adm/moab/moab.ck
• CHECKPOINTINTERVAL - Indicates interval between subsequent checkpoints.– FORMAT - [[[DD:]HH:]MM:]SS
– EXAMPLE - CHECKPOINTINTERVAL 00:15:00
moab.cfg:
1
© Cluster Resources, Inc. 51
4 Integration
• High Availability
• License Managers
• Identity Managers
• Allocation Managers
• Site Specific Integration (Native RM)
1
© Cluster Resources, Inc. 52
High Availability• High Availability allows Moab to run on two different machines,
a primary and secondary server. • While both are running, the secondary server, or fallback server,
will continually update its internal statistics, reservations, and other information to stay synchronized with the primary server.
• Should the primary server stop running, the secondary will pick up all responsibilities of the primary server and begin to schedule jobs and track internal data.
• When the primary server comes back online, the secondary server will hand over its data and resume functionality as the secondary server.
http://clusterresources.com/moabdocs/22.2ha.shtml
1
© Cluster Resources, Inc. 53
High Availability Example
# moab.cfg on master server# (duplicate moab.cfg of the master or the same file using a shared file system)SCHEDCFG[colony] SERVER=head1 FBSERVER=head2
#moab-private.cfg on head1 serverCLIENTCFG[colony] KEY=1dfv-fewv443v HOST=head2 AUTH=admin1
#moab-private.cfg on head2 serverCLIENTCFG[colony] KEY=1dfv-fewv443v HOST=head1 AUTH=admin1
http://clusterresources.com/moabdocs/22.2ha.shtml
1
© Cluster Resources, Inc. 54
Enabling High Availability Features
• Moab runs on two machines, primary and secondary server– The secondary server, or fallback server,
will continually update its internal statistics, reservations, and other information to stay synchronized with the primary server and take over scheduling should the primary server fail
1
© Cluster Resources, Inc. 55
Configuring High Availability
moab.cfg
SCHEDCFG[mycluster] SERVER=primaryhostname:3000
SCHEDCFG[mycluster] FBSERVER=secondaryhostname
• Both the SERVER and FBSERVER are of the format: <HOST>[:<PORT>]. It is also necessary to ensure a few configuration settings for correct operation:
• each server must specify a shared key using the clientcfg parameter in the moab-private.cfg file.
• each server must be properly configured as an administrator inside of the resource manager using the clientcfg AUTH parameter.
• each server can properly communicate with the resource manager.
(See the torque/pbs integration guide for a specific example.)
1
© Cluster Resources, Inc. 56
Confirming Configuration• Run mdiag –R to confirm fallback Moab is able to communicate with the
primary Moab
node40:~/# mdiag -RRM[rmnode30] Type: PBS State: Active ResourceType: COMPUTE Version: '1.2.0p6-snap.1122589577' Nodes Reported: 4 Flags: executionServer,noTaskOrdering,typeIsExplicit Partition: rmnode30 Event Management: EPORT=15004 NOTE: SSS protocol enabled Submit Command: /usr/local/bin/qsub DefaultClass: batch RM Performance: AvgTime=0.01s MaxTime=1.03s (218 samples)
RM[internal] Type: SSS State: Active Version: 'SSS2.0' Flags: executionServer,localQueue,typeIsExplicit RM Performance: AvgTime=0.00s MaxTime=0.00s (125 samples)
NOTE: use 'mrmctl -f -r ' to clear stats/failures
1
© Cluster Resources, Inc. 57
Confirmation cont.• Run mdiag –n to confirm fallback Moab is able to communicate with the
primary resource manager.
compute node summaryName State Procs Memory Opsys
node31 Idle 1:1 27:27 Linux-2.6node32 Idle 1:1 27:27 Linux-2.6node33 Idle 1:1 27:27 Linux-2.6node34 Idle 1:1 27:27 Linux-2.6----- --- 4:4 108:108 -----
Total Nodes: 4 (Active: 0 Idle: 4 Down: 0)
1
© Cluster Resources, Inc. 58
License Management• Moab supports both node-locked and floating license
models and even allows mixing the two models simultaneously
• Methods for determining license availability– Local Consumable Resources– Resource Manager Based Consumable Resources– Interfacing to an External License Manager
• Requesting Licenses within Jobs
#qsub> qsub -l nodes=2,software=blast cmdscript.txt
1
© Cluster Resources, Inc. 59
Identity ManagersAn identity manager is configured with the IDCFG parameter and allows
Moab to exchange information with an external identity management service. As with Moab's resource manager interfaces, this service can be a full commercial package designed for this purpose, or something far simpler by which Moab obtains the needed information for a web service, text file, or database.
# moab.cfgIDCFG[alloc] SERVER=exec://$TOOLSDIR/idquery.pl
# idquery.pl outputgroup:financial fstarget=16.3 alist=project2 group:marketing fstarget=2.5 group:engineering fstarget=36.7 group:dm fstarget=42.5
1
© Cluster Resources, Inc. 60
Allocation Management
Allocation Management Overview
Gold Capabilities/Features
Allocation Manager Example
Jan-99 Feb-99 Mar-99 Apr-99 May-99 Jun-99 Jul-99
Alloc6
Alloc5
Alloc4
Alloc3
Alloc2
Alloc1
.
10,000 NH
10,000 NH
10,000 NH
10,000 NH
10,000 NH
10,000 NH
1
© Cluster Resources, Inc. 61
Allocation ManagementGold is an open source allocation system that controls
project usage on High Performance Computers.
• What does it do?– Pre-allocates resources to projects and users
– Controls project usage by billing for resource utilization
– Fine-grained control over who uses what, where and when
– Tracks resource utilization
– Allows for insightful capacity planning
– Facilitates resource sharing between organizations (Grids)
• Who needs it?– Sites with many projects
– Multi-cluster organizations
– Multi-organization grids
– QoS/SLA-based credit management
– Any credit/economic style environment
http://clusterresources.com/moabdocs/6.4allocationmanagement.shtml
1
© Cluster Resources, Inc. 62
Gold Features• Enforces long-term usage limits
• Uses Reservations to Enforce Allocations
• Online Bank (Dynamic Charging)
• Journaled Account History
• Promotes Resource Sharing between Organizations (Grids)
• Facilitates capacity planningNow- 1 Qtr + 3 Qtr+ 2 Qtr+ 1 Qtr- 3 Qtr - 2 Qtr + 4 Qtr
100 % Capacity
Now- 1 Qtr + 3 Qtr+ 2 Qtr+ 1 Qtr- 3 Qtr - 2 Qtr + 4 Qtr
100 % Capacity
http://www.emsl.pnl.gov/docs/mscf/gold
1
© Cluster Resources, Inc. 63
Allocation Manager Example
• Tightly integrated with Moab
• Can run in monitor mode only or can enforce allocations
# moab.cfg
AMCFG[bank] SERVER=gold://gold-server.some.org JOBFAILUREACTION=HOLD AMCFG[bank] TIMEOUT=15
# moab-private.cfg
CLIENTCFG[AM:bank] KEY=mysecr3t AUTHTYPE=HMAC
http://clusterresources.com/moabdocs/6.4allocationmanagement.shtml
1
© Cluster Resources, Inc. 64
Other Allocation Management Options
• Qbank
• Moab’s native interface
1
© Cluster Resources, Inc. 65
Site Specific Integration with theNative Resource Manager InterfaceEverything you’ve ever wanted to do with Moab -- An interface that allows sites to replace or augment their already existing resource managers with information from the following:
Example Usage– Arbitrary Scripts– Ganglia– FlexLM– MySQL
http://clusterresources.com/moabdocs/13.5nativerm.shtml
http://clusterresources.com/moabdocs/13.7licensemanagement.shtml
PBSPBS
Native Resource Native Resource ManagerManager
Node/JobNode/JobModifyModify
Node/ Job Node/ Job InfoInfo
MoabMoab
NodeNodeAvailabilityAvailability
JobJobExecutionExecution
PBSPBS
Native Resource Native Resource ManagerManager
Node/JobNode/JobModifyModify
Node/ Job Node/ Job InfoInfo
MoabMoab
NodeNodeAvailabilityAvailability
JobJobExecutionExecution
1
© Cluster Resources, Inc. 66
Native Resource Manager Example
# moab.cfg
# interface w/TORQUERMCFG[torque] TYPE=PBS
# interface w/flexLMRMCFG[flexLM] TYPE=NATIVE RTYPE=license RMCFG[flexLM] CLUSTERQUERYURL=exec:///$HOME/tools/license.mon.flexlm.pl
# integrate local node health check script dataRMCFG[local] TYPE=NATIVERMCFG[local] CLUSTERQUERYURL=file:///opt/moab/localtools/healthcheck.dat
1
© Cluster Resources, Inc. 67
Utilizing Multiple Resource Managers• Migrate jobs between resource managers
• Aggregate Information into a cohesive node view#moab.cfg RESOURCELIST node01,node02...RMCFG[base] TYPE=PBSRMCFG[network] TYPE=NATIVE:AGFULLRMCFG[network] CLUSTERQUERYURL=/tmp/network.shRMCFG[fs] TYPE=NATIVE:AGFULLRMCFG[fs] CLUSTERQUERYURL=/tmp/fs.sh
#sample network script
_RX=`/sbin/ifconfig eth0 | grep "RX by" | cut -d: -f2 | cut -d' ' -f1`; \_TX=`/sbin/ifconfig eth0 | grep "TX by" | cut -d: -f3 | cut -d' ' -f1`; \echo `hostname` NETUSAGE=`echo "$_RX + $_TX" | bc`;
1
© Cluster Resources, Inc. 68
5. Scheduling Behaviour
• Job Priority
• Fairshare
• Usage Limits
• Optimizing the Scheduler
1
© Cluster Resources, Inc. 69
Credentials• Certain job attributes (such as user, group, account, class and qos)
describe entities the job belongs to and can be used to associate policies with jobs.
• Every Job has credentials– Users (The only mandatory credential)
– Groups (Standard Unix group or arbitrary collection of users)
– Accounts (Associated with projects and billing)
– Class (Associated with RM queues)
– Quality Of Service (QoS) (Policy overrides, resource access, service targets, charge rates)
• All Credentials can have Usage Limits, Fairshare Targets, Priorities, Usage History, Credential Access Lists / Defaults
http://clusterresources.com/moabdocs/3.5credoverview.shtml
1
© Cluster Resources, Inc. 70
Credential Membership
• Membership Examples# moab.cfg
# user steve can access accounts a14, a7, a2, a6, and a1. If no account is explicitly # requested, his job will be assigned to account a2USERCFG[steve] ADEF=a2 ALIST=a14,a7,a2,a6,a1
# moab.cfg
# account omega3 can only be accessed by users johnh, stevek, jenpACCOUNTCFG[omega3] MEMBERULIST=johnh,stevek,jenp
# moab.cfg
# Controlling QoS Access on a Per Group BasisGROUPCFG[staff] QLIST=standard,special QDEF=standard
1
© Cluster Resources, Inc. 71
Fairness
• Definition:– giving all users equal access to
compute resources– incorporating historical resource
usage, political issues, and job value
• Moab provides a comprehensive and flexible set of tools allowing the ability to address the many and varied fairness management needs.
http://clusterresources.com/moabdocs/6.0managingfairness.shtml
1
© Cluster Resources, Inc. 72
Performance Metrics• Metrics of Responsiveness
– Queue Time• How long a job’s been waiting
– X Factor• Duration-weighted time responsiveness factor
• Strongest single factor of perceived fairness
• Metrics of Utilization– Throughput
• Jobs per unit time
– Utilization• Percentage of cluster in use
XFactor = Y / X
Submit Time
X
Y
StartTime
XFactor = Y / X
Submit Time
X
Y
StartTime
http://clusterresources.com/moabdocs/5.1.2priorityfactors.shtml
1
© Cluster Resources, Inc. 73
General Fairness Strategies
• Maximize Scheduler Options -- Do Not Overspecify• Keep It Simple – Do Not Address Hypothetical Issues• Seek To Adjust User Behaviour, Not Limit User Options• Allow Users to Specify Required Service Level• Monitor Cluster Performance Regularly• Tune Policies As Needed
1
© Cluster Resources, Inc. 74
Priority• 2-tier prioritization structure• Independent component and subcomponent
weights/caps• Components include service, target, fairshare,
resource, usage, job attribute, and credential• Negative priority jobs may be blocked• Tuning facility available with mdiag -p
http://clusterresources.com/moabdocs/5.1jobprioritization.shtml
1
© Cluster Resources, Inc. 75
Job Prioritization – Component Overview• Service
• Level of service delivered or anticipated • Includes queue time, xfactor, bypass, policy violation
• Target– Desired service level– Provides exponential factor growth– Includes target queue time, target xfactor
• Credential• Based on credential priorities• Includes user, group, account, QoS, and class
http://www.clusterresources.com/moabdocs/5.1.2priorityfactors.shtml#cred
http://www.clusterresources.com/moabdocs/5.1.2priorityfactors.shtml#service
http://www.clusterresources.com/moabdocs/5.1.2priorityfactors.shtml#target
1
© Cluster Resources, Inc. 76
Job Prioritization – Component Overview• Fairshare
– Includes user, group, account, QoS, and class fairshare– Includes current Based on historical resource consumption– usage metric of jobs per user, procs per user, and ps per
user– May allow prioritization with ‘cap’ fairshare target
• Resource– Based on requested resources– Includes nodes, processors, memory, swap, disk, and proc-
equivalents– Includes duration based metrics of walltime and proc-
secondshttp://www.clusterresources.com/moabdocs/5.1.2priorityfactors.shtml#fairshare
http://www.clusterresources.com/moabdocs/5.1.2priorityfactors.shtml#resource
1
© Cluster Resources, Inc. 77
Job Prioritization – Component Overview
• Job Attribute• Allows prioritization based on current job state• Allows prioritization based on job attributes (ie, preemptible)• Useful in preemption based scheduling
• Usage• Based on utilized resources• Includes resources utilized, resources remaining, percent
remaining• Useful in preemption based scheduling
http://www.clusterresources.com/moabdocs/5.1.2priorityfactors.shtml#usage
http://www.clusterresources.com/moabdocs/5.1.2priorityfactors.shtml#attr
1
© Cluster Resources, Inc. 78
mdiag -p
1
© Cluster Resources, Inc. 79
Sample Priority Usage
• A site wants to do the following: – Favor jobs in the low, medium, and high QOS's so they will run in QOS
order – balance job expansion factor – use job queue time to prevent jobs from starving
• The sample moab.cfg is listed below:
# moab.cfgQOSWEIGHT 1 QUEUETIMEWEIGHT 10
QOSCFG[low] PRIORITY=1000 QOSCFG[medium] PRIORITY=10000QOSCFG[high] PRIORITY=100000
1
© Cluster Resources, Inc. 80
Credential Priority Example
# moab.cfg
# Service Priority Factors
SERVWEIGHT 1
XFACTORWEIGHT 10
QUEUETIMEWEIGHT 1000
# Credential Priority Factors
CREDWEIGHT 1
USERWEIGHT 1
CLASSWEIGHT 2
USERCFG[john] PRIORITY=200
CLASSCFG[batch] PRIORITY=15
CLASSCFG[debug] PRIORITY=100 XFWEIGHT=100
ACCOUNTCFG[bottomfeeder] PRIORITY=-5000 QTWEIGHT=1 XFWEIGHT=0
1
© Cluster Resources, Inc. 81
Priority Caps
#moab.cfg
XFACTORCAP 1000
QUEUETIMECAP 1000
QOSCAP 10000
• It is also possible to limit the priority contribution due to a particular priority factor
1
© Cluster Resources, Inc. 82
Manual Job Priority Adjustment
Sometimes you need to….• Run an admin test job as soon as possible
• Pacify a disserviced user
Use the Setspri command:
• setspri [-r] priority jobid
Example: setspri 1 PBS.1234.0
1
© Cluster Resources, Inc. 83
Usage Limits/Throttling
• Usage Limits
• Override Limits
• Idle Job Limits
• System Job Limits
• Hard and Soft Limits
http://clusterresources.com/moabdocs/6.2throttlingpolicies.shtml
1
© Cluster Resources, Inc. 84
Usage Limit Example
# moab.cfg
USERCFG[steve] MAXJOB=2 MAXNODE=30 GROUPCFG[staff] MAXJOB=5CLASSCFG[DEFAULT] MAXNODE=16CLASSCFG[batch] MAXNODE=32
# moab.cfg
# allow class batch to run up the 3 simultaneous jobs and# allow any user to use up to 8 total nodes within class batchCLASSCFG[batch] MAXJOB=3 MAXNODE[USER]=8
# allow users steve and bob to use up to 3 and 4 total processors respectively within class CLASSCFG[fast] MAXPROC[USER:steve]=3 MAXPROC[USER:bob]=4
1
© Cluster Resources, Inc. 85
• Supersedes the limits of other credentials, effectively causing all other limits of the same type (ie, MAXJOB) to be ignored.
• Precede the limit specification with the capital letter 'O'.
Override Limits
# moab.cfg
USERCFG[steve] MAXJOB=2 MAXNODE=30 GROUPCFG[staff] MAXJOB=5 CLASSCFG[DEFAULT] MAXNODE=16 CLASSCFG[batch] MAXNODE=32 QOSCFG[hiprio] OMAXJOB=3 OMAXNODE=64
1
© Cluster Resources, Inc. 86
Idle Job Limits• Limits the jobs that are currently eligible for
scheduling.• Jobs that do not qualify as eligible do not accumulate
priority in the queue.• Often used to prevent queue stuffing
# moab.cfg
USERCFG[steve] MAXIJOB=2 MAXINODE=30 GROUPCFG[staff] MAXIJOB=5 CLASSCFG[DEFAULT] MAXINODE=16 CLASSCFG[batch] MAXINODE=32 QOSCFG[hiprio] MAXIJOB=3 MAXINODE=64
1
© Cluster Resources, Inc. 87
System Job LimitsLimit Parameter Description
duration SYSTEMMAXJOBWALLTIME limits the maximum requested wallclock time per job
processors SYSTEMMAXPROCPERJOB limits the maximum requested processors per job
processor-seconds SYSTEMMAXPROCSECONDPERJOB limits the maximum requested processor-seconds per job
1
© Cluster Resources, Inc. 88
Hard and Soft Limits
• Balance both fairness and utilization
#moab.cfg
USERCFG[steve] MAXJOB=2,4 MAXNODE=15,30 GROUPCFG[staff] MAXJOB=2,5 CLASSCFG[DEFAULT] MAXNODE=16,32 CLASSCFG[batch] MAXNODE=12,32 QOSCFG[hiprio] MAXJOB=3,5 MAXNODE=32,64
1
© Cluster Resources, Inc. 89
Fairshare
• Fairshare scheduling helps steer a system toward usage targets by adjusting job priorities based on short term historical usage.
• Moab's fairshare can target usage percentages, ceilings, floors or caps for users, groups, accounts, classes, and QOS levels.
http://clusterresources.com/moabdocs/6.3fairshare.shtml
1
© Cluster Resources, Inc. 90
Fairshare Example# moab.cfgFSINTERVAL 12:00:00 FSDEPTH 4 FSDECAY 0.5FSPOLICY DEDICATEDPS
# all users should have a fs target of 10%USERCFG[DEFAULT] FSTARGET=10.0
# user john gets extra cycles USERCFG[john] FSTARGET=20.0
# reduce staff priority if group usage exceed 15% GROUPCFG[staff] FSTARGET=15.0-
# give group orion additional priority if usage drops below 25.7%GROUPCFG[orion] FSTARGET=25.7+
FSUSERWEIGHT 10FSGROUPWEIGHT 100
http://clusterresources.com/moabdocs/6.3fairshare.shtml
Decay Factor
Interval
NowPast
Depth = Number of Intervals
Dec
ay
Fa
cto
r W
eig
hti
ng
s
EffectiveFairshare
Usage
=
Co
nsu
mp
tio
n M
etri
c
Decay Factor
Interval
NowPast
Depth = Number of Intervals
Dec
ay
Fa
cto
r W
eig
hti
ng
s
EffectiveFairshare
Usage
=
Co
nsu
mp
tio
n M
etri
c
1
© Cluster Resources, Inc. 91
Fairshare stats
• Provide credential-based
usage distributions over time
• mdiag –f
• Maintained for all credentials
• Stored in stats/FS.${epochtime}
• Shows detailed time-distribution usage by fairshare metric
1
© Cluster Resources, Inc. 92
Optimization Optimization is maximizing performance while fully
addressing all mission objectives. True optimization includes aspects of policy selection, increased availability, user training, and other factors.
• Identifying Policy Bottlenecks• Identifying Resource Fragmentation• Preemption• Malleable/Dynamic Jobs• Backfill
1
© Cluster Resources, Inc. 93
Productivity Losses
Remaining Productivity
Environmental Losses
•File System Failures•Network Failures•Hardware Failures
Political Losses
•Underutilization of Resources Due to Overly Strict Political Access Constraints
Intra-job inefficiencies
•Poorly Designed Jobs•Poorly Functioning Jobs•Heterogeneous Resources Allocated
Partitioning Losses
•Underutilization of Resources Due to Physical Access Limits
Hardware Failures
•Job Loss and Delay due to Node, Network, and other Infrastructure Failures
Scheduling Inefficiencies
•Managing complex site policies•“Keeping Everybody Happy”•Scheduling jobs where they will finish faster rather than where they will start sooner
Middleware Failures/ Overhead
•Licensing•Network Applications•Resource Managers
Moab based systemsconsistently achieve
90-99% utilizationand objective-based
resource delivery guarantees
1
© Cluster Resources, Inc. 94
Identifying Policy Bottlenecks
• Most Optimization is Enable by Default
• Sources of Bottlenecks– Usage Limits, Fairshare Caps
– Eval Steps• Verify priority (are most important jobs getting access to
resources first?)
http://clusterresources.com/moabdocs/commands/mdiag-priority.shtml
1
© Cluster Resources, Inc. 95
Identifying Policy Bottlenecks (contd)• Sources of Bottlenecks (contd)
– Eval Steps (contd)• Check job blockage• Adjust Limits, Caps, Priority as needed• If needed, use simulation to determine performance impact of changes
http://clusterresources.com/moabdocs/commands/mdiag-queues.shtml
1
© Cluster Resources, Inc. 96
Identifying Resource Fragmentation
• Fragmentation based on queues, reservations, partitions, os's, architectures, etc.
• Recommend changes, use node sets, soften reservations, time-based reservations, etc.
• User training to eliminate user specified fragmentation
1
© Cluster Resources, Inc. 97
Preemption• Conflict between high utilization for cluster and
guarantees for important jobs
• Preemption allows scheduler to 'retract' some scheduling decisions to address newly submitted workload
• QoS-based preemption allows scheduler to enable preemption only if targets cannot be satisfied in other ways
http://www.clusterresources.com/products/mwm/docs/8.4preemption.shtml
1
© Cluster Resources, Inc. 98
Malleable/Dynamic Jobs
• Moab adjusts jobs to utilize available resource and fill holes
• Moab adjusts both job size and job duration
• Only supported with resource managers which support dynamic job modification (i.e. TORQUE) or with msub
http://www.clusterresources.com/products/mwm/docs/22.4dynamicjobs.shtml
1
© Cluster Resources, Inc. 99
Backfill
• Allows a scheduler to make better use of available resources by running jobs out of order
• Prioritizes the jobs in the queue according to a number of factors and then orders the jobs into a
highest priority first (or priority FIFO) sorted list
1
© Cluster Resources, Inc. 100
6. Resource Access
• Admin Reservations
• Standing Reservations
• Nodesets
• Node Access Policies
• Partitions
1
© Cluster Resources, Inc. 101
Advance Reservations
All reservations require three items:• Resources
– Under Moab, the resources specified for a reservation are specified by way of a task description.
– Task- an atomic, or indivisible, collection of resources (processors, memory, swap,
local disk, etc.) • Timeframe
– start time and an end time.
• Access control list– which jobs can use a reservation (users, groups, accounts, classes, QOS, and job
duration)
An advance reservation is the mechanism by which Moab guarantees the availability of a set of resources at a particular
time.
1
© Cluster Resources, Inc. 102
Reservation Management Commands
Commands Flags Description
mdiag -r display summarized reservation information and any unexpected state
mrsvctl reservation control
mrsvctl -r remove reservations
mrsvctl -c create an administrative reservation
showres display information regarding location and state of reservations
1
© Cluster Resources, Inc. 103
Reservation MappingJob X, which meets access criteria for both reservation A and B, allocates a portion of its resources from each reservation and the remainder from resources outside of both reservations.
1
© Cluster Resources, Inc. 104
Advance Reservations• Job Reservatons• Admin Reservations
– generally created to address non-periodic, 'one time' issues.
• Standing Reservations– provide a mechanism by which
a site can dedicate a particular block of resources for a special use on a regular daily or weekly basis.
• Personal User Reservations– created by End User
http://clusterresources.com/moabdocs/7.1.5managingreservations.shtml
http://clusterresources.com/moabdocs/7.1.6userreservations.shtml
1
© Cluster Resources, Inc. 105
Job Reservation Policies • Reasons to Increase RESERVATIONDEPTH
– the estimated job starttime information provided by the showstart command is heavily used and the accuracy needs to be increased
– priority dilution is preventing certain key mission objectives from being fulfilled
– users are more interested in knowing when their job will run than in having it run sooner
• Reasons to Decrease RESERVATIONDEPTH – scheduling efficiency and job throughput need to be increased
##moab.cfg
RESERVATIONDEPTH[bigmem] 4 RESERVATIONQOSLIST[bigmem] special,fast,joshua
1
© Cluster Resources, Inc. 106
Administrative Reservations
• Created using: – mrsvctl -c (or setres) command
• Persistent until they expire or are removed using:– mrsvctl -r (or releaseres) command.
1
© Cluster Resources, Inc. 107
Administrative Reservations
• Examples– Created using ‘mrsvctl’
#reserve nodes node01 and node02 for administrative updates> mrsvctl –c –h node01,node03, -d 1:00:00 –s +1:00:00
#reserve 6 tasks for project acme (only jobs in account acme can run in reservation)> mrsvctl –c –t 6 –a account==acme
1
© Cluster Resources, Inc. 108
Annotating Admin Reservations
• Label and annotate reservations using comments allowing other admins, local users, and portals and other services to obtain more detailed information on the reservations.
– Use the '-n' and '-D' options of the mrsvctl command
> mrsvctl -c -D 'testing infiniband performance' -n nettest -h 'r:agt[15-245]'
mrsvctl -c example
1
© Cluster Resources, Inc. 109
Using Reservation Profiles
• Reservation profiles can be set up and utilized to prevent repetition of standard reservation attributes.
– Specify reservation names, descriptions, ACL's, durations, hostlists, triggers, flags, and other aspects which are commonly used.
RSVPROFILE[mtn1] TRIGGER=AType=exec,Action="/tmp/trigger1.sh",EType=start RSVPROFILE[mtn1] USERLIST=steve,marym RSVPROFILE[mtn1] HOSTEXP="r:50-250"
> mrsvctl -c -P mtn1 -s 12:00:00_10/03 -d 2:00:00
moab.cfg
mrsvctl -c
1
© Cluster Resources, Inc. 110
System Reservations• Easy to reserve entire cluster, or only
sections of it:
Can easily be scripted to roll out updates acrossthe entire cluster at specific times, ensuring that no workload will be interrupted
> mrsvctl –c –t ALL> mrsvctl –c –t ALL –s +1:00:00 –g staff> mrsvctl –c –h node0[0-9][0-9] –d 24:00:00
> mrsvctl –c –h node[0-9][0-9][0-9] –T Action=“/tmp/update.pl \ $HOSTLIST”,atype=exec,etype=start –s 23:50:00_6/15 –d 15:00
1
© Cluster Resources, Inc. 111
Optimizing Maintenance Reservations• Configured to reduce its effective reservation shadow by allowing overlap
with checkpointable/preemptible jobs up until the time the reservation becomes active.
– Modify the reservation to disable preemption access
– Preempt jobs which may overlap the reservation
– Cancel any jobs which failed to properly checkpoint and exit
##moab.cfgRSVPROFILE[adm1] JOBATTRLIST=PREEMPTIBLE RSVPROFILE[adm1] DESCRIPTION="regular system maintenance" RSVPROFILE[adm1] TRIGGER=EType=start,Offset=-300,AType=modify,Action="acl-=jattr=PREEMPTIBLE" RSVPROFILE[adm1] TRIGGER=EType=start,Offset=-240,AType=jobpreempt,Action="checkpoint" RSVPROFILE[adm1] TRIGGER=EType=start,Offset=-60,AType=jobpreempt,Action="cancel"
> mrsvctl -c -P adm1 -s 12:00:00_10/03 -d 8:00:00 -h ALL
1
© Cluster Resources, Inc. 112
Standing Reservations
http://clusterresources.com/moabdocs/7.1.5managingreservations.shtml
• Standing Reservations provide a mechanism by which a site can dedicate a particular block of resources for a special use on a regular daily or weekly basis. For example, nodes 1-4 could be dedicated to running jobs only from users in the accounting group every Friday from 4 to 10 PM.
# moab.cfgSRCFG[fast] PERIOD=DAY STARTTIME=16:00:00 ENDTIME=22:00:00SRCFG[fast] HOSTLIST=node0[1-4]$SRCFG[fast] GROUPLIST=accounting
1
© Cluster Resources, Inc. 113
Standing Reservation Example
# moab.cfg
SRCFG[shortpool] OWNER=ACCOUNT:jupiterSRCFG[shortpool] FLAGS=SPACEFLEXSRCFG[shortpool] MAXTIME=1:00:00SRCFG[shortpool] TASKCOUNT=16SRCFG[shortpool] STARTTIME=9:00:00SRCFG[shortpool] ENDTIME=17:00:00SRCFG[shortpool] DAYS=Mon,Tue,Wed,Thu,Fri
http://clusterresources.com/moabdocs/7.1.5managingreservations.shtml
This following reservation (known as a shortpool) only allows jobs to enter it that are guaranteed to complete within an hour. It floats around incorporating nodes with jobs which will free up within the MAXTIME timeframe. This ensures there will be resources available for quick turnaround work.
1
© Cluster Resources, Inc. 114
Rollback Reservations• Specifies the minimum time in the future at which the
reservation may start. This offset is rolling meaning the start time of the reservation will continuously rollback into the future so as to maintain this offset.
• Rollback offsets are a good way of providing guaranteed resource access to users under the conditions that they must commit their resources in the future or lose dedicated access.
http://clusterresources.com/moabdocs/7.1.5managingreservations.shtml
http://clusterresources.com/moabdocs/7.1.5managingreservations.shtml#ROLLBACKOFFSET
Rolling ReservationTime Offset
No
des
JobsTime
Rolling ReservationTime Offset
No
des
JobsTime
1
© Cluster Resources, Inc. 115
Rollback Reservations Example
• The standing reservation will guarantee access to up to 32 processors within 24 hours to jobs from the ajax account
# moab.cfg
SRCFG[ajax_rsv] ROLLBACKOFFSET=24:00:00 TASKCOUNT=32SRCFG[ajax_rsv] PERIOD=INFINITY ACCOUNTLIST=ajax
1
© Cluster Resources, Inc. 116
End User Reservations• Enabling Personal Reservation Management
– enabled on a per QOS basis by setting the ENABLEUSERRSV flag as in the example below:
– A non-admin user wishes to create a reservation he must ALWAYS specify an accountable QOS with the mrsvctl -S flag.
##moab.cfgQOSCFG[titan] QFLAGS=ENABLEUSERRSV # allow 'titan' QOS jobs to create user reservations USERCFG[DEFAULT] QDEF=titan # allow all users to access 'titan' QOS
> mrsvctl -c -S AQOS=titan -h node01 -d 1:00:00 -s 1:30:00 NOTE: reservation test.126 created
1
© Cluster Resources, Inc. 117
#moab.cfgQOSCFG[rsv] QFLAGS=ENABLEUSERRSV # allow 'rsv' QOS jobs to create user reservations GROUPCFG[eng] QDEF=rsv # allow all users in group eng to access 'rsv' QOS
Example: Allow all Users in Engineering Group to Create Personal Reservations
#moab.cfg# special qos has higher job priority and ability to create user reservations QOSCFG[special] QFLAGS=ENABLEUSERRSVQOSCFG[special] PRIORITY=1000
# allow betty and steve to user special qosUSERCFG[betty] QDEF=specialUSERCFG[steve] QLIST=fast,special,basic QDEF=special
Example: Allow Specific Users to Create Personal Reservations
1
© Cluster Resources, Inc. 118
Reservation LimitsLimit Description
RMAXDURATION limits the duration (in seconds) of any single personal reservation
RMAXPROC limits the size (in processors) of any single personal reservation
RMAXPS limits the size (in processor-seconds) of any single personal reservation
RMAXCOUNT limits the total number of personal reservations a credential may have active at any given moment
RMAXTOTALDURATION limits the total duration of personal reservations a credential may have active at any given moment
RMAXTOTALPROC limits the total number of processors a credential may reserve active at any given moment
RMAXTOTALPS limits the total number of processor-seconds a credential may reserve active at any given moment
1
© Cluster Resources, Inc. 119
Node Sets• Allow a job to request a set of common resources without specifying
exactly what resources are required. • Node set policy can be specified globally or on a per-job basis and can
be based on node processor speed, memory, network interfaces, or locally defined node attributes.
• These policies may also be used to guide jobs to one or more types of nodes on which a particular job performs best
# moab.cfg
NODESETATTRIBUTE FEATURENODESETLIST switchA,switchB,switchC,switchDNODESETPOLICY ONEOFNODESETISOPTIONAL FALSE
CLASSCFG[amd] DEFAULT.NODESET=ONEOF:FEATURE:ATHLON,OPTERON
http://clusterresources.com/moabdocs/8.3nodesetoverview.shtml
1
© Cluster Resources, Inc. 120
Node Access Policy (SMP Issue)
• Shared vs Dedicated Node Access– SHARED - Moab will allow tasks of other
jobs to use the resources– SINGLEJOB – One job only – multiple
tasks possible.– SINGLETASK – One task only– SINGLEUSER – allows multiple jobs from
same user
1
© Cluster Resources, Inc. 121
Node Allocation• Allow a site to specify how available resources should
be allocated to each job– NODEALLOCATIONPOLICY
• Heterogeneous resources (resources which vary from node to node in terms of quantity or quality)
• Shared nodes (nodes may be utilized by more than one job)
• Reservations or service guarantees • Non-flat network (a network in which a perceptible
performance degradation may potentially exist depending on workload placement)
1
© Cluster Resources, Inc. 122
Resource Based Algorithms
• CPULOAD• FIRSTAVAILABLE• LASTAVAILABLE• PRIORITY• MINRESOURCE• CONTIGUOUS• MAXBALANCE• FASTEST• LOCAL
1
© Cluster Resources, Inc. 123
Other Algorithms
• Time Based Algorithms– Large backlog – Large number of system or standing
reservations – Heavy use of backfill
• Locally Defined Algorithms• Specifying Per Job Resource
Preferences
1
© Cluster Resources, Inc. 124
Resource Provisioning
• Selects a resource to modify if resources are not available to meet the needs of the current requests
• Configure an interface to a provisioning manager– SystemImager– Xen
1
© Cluster Resources, Inc. 125
Partitions• Divide Resources along Resource and
Political Boundaries• Avoid Partitions when possible (Use
NodeSets, or Reservations)• Cannot Span Resource Managers• Jobs can Span with COALLOC Flag• Partition Access can be Managed on a
Credential Basis
http://clusterresources.com/moabdocs/7.2partitions.shtml
1
© Cluster Resources, Inc. 126
Defining Partitions
• NODECFG
# moab.cfg NODECFG[node001] PARTITION=astronomyNODECFG[node002] PARTITION=astronomy...NODECFG[node049] PARTITION=mathRMCFG[base] TYPE=PBS
1
© Cluster Resources, Inc. 127
Managing Partition Access
• Use the *CFG parameter with PLIST and PDEF keywords– Ex. USERCFG, ACCOUNTCFG
# moab.cfg
SYSCFG[base] PLIST= USERCFG[DEFAULT] PLIST=generalUSERCFG[steve] PLIST=general:test PDEF=testGROUPCFG[staff] PLIST=general:test PDEF=generalGROUPCFG[mgmt] PLIST=general:test PDEF=general
1
© Cluster Resources, Inc. 128
Partitions cont.
• Requesting Partitions– Add -l partition=test to qsub command line (For
Torque)– Select partition in Moab Cluster Manager or Moab
Access Portal
• Special jobs may be allowed to span the resources of multiple partitions if desired by associating the job with a QOS which has the flag 'COALLOC' set.
1
© Cluster Resources, Inc. 129
7. Peer to Peer (Grids)
• Cluster Stack / Framework
• Moab P2P Grid
• Peer Configuration
• Resource Control Overview
• Data Management
• Security
http://clusterresources.com/moabdocs/17.0peertopeer.shtml
1
© Cluster Resources, Inc. 130
Cluster Stack / Framework:
Cluster Workload Manager: Scheduler, Policy Manager, Integration PlatformCluster Workload Manager: Scheduler, Policy Manager, Integration Platform
Message PassingMessage Passing
SerialSerialParallelParallel ApplicationApplication
Resource ManagerResource Manager
Grid Workload Manager: Scheduler, Policy Manager, Integration PlatformGrid Workload Manager: Scheduler, Policy Manager, Integration Platform
Operating SystemOperating System
Hardware (Cluster or SMP)Hardware (Cluster or SMP)
PortalPortal
CLICLI
GUIGUI
ApplicationApplication
AdminAdmin UsersUsers
Se
cu
rityS
ec
urity
1
© Cluster Resources, Inc. 131
Grid TypesA “Local Area Grid” uses one instance of Moab within an environment that shares a
user and data space across multiple clusters, that may or may not have multiple hardware types, operating systems and compute resource managers (e.g.
LoadLeveler, TORQUE, LSF, PBS Pro, etc.)
ClusterA
Local Area Grid (LAG)
ClusterB
ClusterC
Shared User SpaceShared Data Space
Moab
ClusterA
Wide Area Grid (WAG)
ClusterB
ClusterC
Multiple User SpacesMultiple Data Spaces
Moab (Master)
A “Wide Area Grid” uses multiple Moab instances working together within an environment that can have multiple user and data spaces across multiple clusters, that
may or may not have multiple hardware types, operating systems and compute resource managers (e.g. LoadLeveler, TORQUE, LSF, PBS Pro, etc.). Wide Area Grid
management rules can be centralized, locally controlled or mixed.
Moab Moab Moab
ClusterA
ClusterB
ClusterC
Moab (Grid Head Node)
Moab Moab Moab
Centralized Management
ClusterA
ClusterB
ClusterC
Moab
Centralized & Local Management
All Grid Rules
Moab (Grid Head Node)Shared Grid Rules
Local Grid Rules
Moab
Local Grid Rules
Moab
Local Grid Rules
ClusterA
ClusterB
ClusterC
Moab
Local Management“Peer to Peer”
Local Grid Rules
Moab
Local Grid Rules
Moab
Local Grid Rules
Grid Management Scenarios
1
© Cluster Resources, Inc. 132
Grid Benefits
• Scalability
• Resource Access
• Load-Balancing
• Single System Image (SSI)
• High Availability
1
© Cluster Resources, Inc. 133
Drawbacks of Layered Approach• Stability
– Additional failure layer– Centralized grid management
(single point of failure)
• Optimization– Limited local information and control
• Admin Experience– Additional tool to learn/configure– Policy Duplication and Conflicts– Additional tool to manage/troubleshoot
• User Experience– Additional submission language/environment– Additional tool to track, manage workload
http://clusterresources.com/moabdocs/17.12p2pgrid.shtml
1
© Cluster Resources, Inc. 134
Moab P2P Approach• Little to no user training
• Little to no admin training
• Single Policy set
• Transparent Grid
http://clusterresources.com/moabdocs/17.0peertopeer.shtml
1
© Cluster Resources, Inc. 135
Integrated Moab P2P/Grid Capabilities
• Distributed Resource Management
• Distributed Job Management
• Grid Information Management– Resource and Job Views
• Credential Management and Mapping
• Distributed Accounting
• Data Management
1
© Cluster Resources, Inc. 136
Grid Relationship Combinations
ClusterA
ClusterB
ClusterC
Shared User SpaceShared Data Space
Moab
Multiple User SpacesMultiple Data Spaces
ClusterD
ClusterE
Moab (Grid Head Node)Shared Grid Rules
Local Area Grid Rules
Moab
Local Grid Rules
Moab
Local Grid Rules
ClusterF
ClusterG
ClusterH
Moab
Local Grid Rules
Moab
Local Grid Rules
Moab
Local Grid Rules
Hosting Site
Moab
Local Grid Rules
Moab is able to facilitate virtually any grid relationship:1. Join local area grids into wide are grids
2. Join wide area grids to other wide area grids (whether they be managed centrally, locally - “peer to peer” or mixed)
3. Resource sharing can be in one direction for use with hosting centers, or to bill out resources to other sites
4. Have multiple levels of grid relationships (e.g. conglomerates within conglomerates within conglomerates)
12
3
4
1
© Cluster Resources, Inc. 137
Basic P2P Example# moab.cfg for Cluster ASCHEDCFG[ClusterA]RMCFG[ClusterB] TYPE=MOAB SERVER=node03:41000RMCFG[ClusterB.INBOUND] FLAGS=CLIENT CLIENT=ClusterB
# moab.cfg for Cluster BSCHEDCFG[ClusterB]RMCFG[ClusterA] TYPE=MOAB SERVER=node01:41000RMCFG[ClusterA.INBOUND] FLAGS=CLIENT CLIENT=ClusterA
# moab-private.cfg for Cluster ACLIENTCFG[RM:ClusterB] KEY=fet$wl02 AUTH=admin1
# moab-private.cfg for Cluster BCLIENTCFG[RM:ClusterA] KEY=fet$wl02 AUTH=admin1
1
© Cluster Resources, Inc. 138
Peer Configuration• Resource Reporting
• Credential Config
• Data Config
• Usage Limits
• Bi-Directional Job Flow
#moab.cfg (server 1)SCHEDCFG[server1] SERVER=server1.omc.com:42005 MODE=NORMAL RMCFG[server2-out] TYPE=MOAB SERVER=server2.omc.com:42005 CLIENT=server2 RMCFG[server2-in] FLAGS=client CLIENT=server2
##moab-private.cfg (server 1)CLIENTCFG[server2] KEY=443db-writ4
1
© Cluster Resources, Inc. 139
Jobs• Submitting Jobs to the Grid
– msub – Uses Resource Manager’s submission language
and translates to msub• Viewing Node and Job Information
– Each destination Moab server will report all compute nodes it finds back to the source Moab server
– Show as local nodes each within a partition associated with the resource manager reporting them.
1
© Cluster Resources, Inc. 140
Resource Control Overview
• Full resource information– nodes appear with complete remote hostnames
and full attribute information • Remapped resource information
– nodes appear with remapped local hostnames and full attribute information
• Grid mode– information regarding nodes reported from a
remote peer is aggregated and transformed into one or more SMP-like large pseudo nodes
1
© Cluster Resources, Inc. 141
Controlling Resource Information
• Direct– nodes are reported to remote clusters exactly as they
appear in the local cluster
• Mapped– nodes are reported as individual nodes, but node names are
mapped to a unique name when imported into the remote cluster
• Grid– node information is aggregated into a single large SMP-like
pseudo-node before it is reported to the remote cluster
1
© Cluster Resources, Inc. 142
Grid Sandbox
• Constrains external resource access and limits which resources are reported to other peers
##moab.cfg SRCFG[sandbox1] PERIOD=INFINITY HOSTLIST=node01,node02,node03 SRCFG[sandbox1] CLUSTERLIST=ALL FLAGS=ALLOWGRID
1
© Cluster Resources, Inc. 143
Access Controls
• Granting Access to Local Jobs
• Peer Access Control
##moab.cfg SRCFG[sandbox2] PERIOD=INFINITY HOSTLIST=node04,node05,node06 SRCFG[sandbox2] FLAGS=ALLOWGRID QOSLIST=high GROUPLIST=engineer
##moab.cfg (Cluster 1)SRCFG[sandbox1] PERIOD=INFINITY HOSTLIST=node01,node02,node03,node04,node05 SRCFG[sandbox1] FLAGS=ALLOWGRID CLUSTERLIST=ClusterB SRCFG[sandbox2] PERIOD=INFINITY HOSTLIST=node6 FLAGS=ALLOWGRID SRCFG[sandbox2] CLUSTERLIST=ClusterB,ClusterC,ClusterD USERLIST=ALL
1
© Cluster Resources, Inc. 144
Controlling Peer Workload Information
• Local workload exporting– Help simplify administration of different clusters by
centralizing monitoring and management of jobs at one peer and avoids forcing each peer to the type SLAVE
##moab.cfg (ClusterB - Destination Peer) RMCFG[ClusterA] FLAGS=CLIENT,LOCALWORKLOADEXPORT # source peer
1
© Cluster Resources, Inc. 145
Data Management Configuration
• Global file systems• Replicated data servers• Need based direct input• Output data migration ##moab.cfg (NFS data server) RMCFG[storage] TYPE=native SERVER=omc.omc13.com:42004 RTYPE=STORAGE RMCFG[storage] SYSTEMMODIFYURL=exec://$HOME/tools/storage.ctl.nfs.pl RMCFG[storage] SYSTEMQUERYURL=exec://$HOME/tools/storage.query.nfs.pl
##moab.cfg (SCP data server) RMCFG[storage] TYPE=native SERVER=omc.omc13.com:42004 RTYPE=STORAGE RMCFG[storage] SYSTEMMODIFYURL=exec://$HOME/tools/storage.ctl.scp.pl RMCFG[storage] SYSTEMQUERYURL=exec://$HOME/tools/storage.query.scp.pl
1
© Cluster Resources, Inc. 146
Security
• Secret key based security is enabled via the moab-private.cfg file
• Globus Credential Based Server Authentication (4.2.4)
1
© Cluster Resources, Inc. 147
Credential Management
• Peer Credential Mapping
• Source and Destination Side Credential Mapping
##moab.cfg SCHEDCFG[master1] MODE=normal RMCFG[slave1] OMAP=file:///opt/moab/omap.dat
##/opt/moab/omap.dat (source object map file) user:joe,jsmith user:steve,sjohnson group:test,staff class:batch,serial user:*,grid
1
© Cluster Resources, Inc. 148
• Preventing User Space Collisions##moab.cfgSCHEDCFG[master1] MODE=normal RMCFG[slave1] OMAP=file:///opt/moab/omap.dat FLAGS=client
##/opt/moab/omap.dat (source object map file) user:*,c1_*group:*,*_grid account:*,temp_*
• Interfacing with Globus GRAM
##moab.cfg SCHEDCFG[c1] SERVER=head.c1.hpc.org RMCFG[c2] SERVER=head.c2.hpc.org TYPE=moab JOBSTAGEMETHOD=globus
1
© Cluster Resources, Inc. 149
• Limiting Access To Peers
• Limiting Access From Peers
##moab.cfg SCHEDCFG SERVER=c1.hpc.org # only allow staff or members of the research and demo account to use # remote resources on c2 RMCFG[c2] SERVER=head.c2.hpc.org TYPE=moab RMCFG[c2] AUTHGLIST=staff AUTHALIST=research,demo
##moab.cfg SCHEDCFG SERVER=c1.hpc.org FLAGS=client # only allow jobs from remote cluster c1 with group credentials staff or # account research or demo to use local resources RMCFG[c2] SERVER=head.c2.hpc.org TYPE=moab RMCFG[c2] AUTHGLIST=staff AUTHALIST=research,demo
1
© Cluster Resources, Inc. 150
P2P Resource Affinity• Certain compute architectures are able to execute
certain compute jobs more effectively than others
• From a given location, staging jobs to various clusters may require more expensive allocations, more data and network resources, and more use of system services
• Certain compute resources are owned by external organizations and should be utilized sparingly
• Moab allow the use of peer resource affinity to guide jobs to the clusters which make the best fit according to a number of criteria
1
© Cluster Resources, Inc. 151
Management and Troubleshooting• Peer Management Overview
– Use 'mdiag -R' to view interface health and performance/usage statistics
– Use 'mrmctl' to enable/disable peer interfaces – Use 'mrmctl -m' to dynamically modify/configure
peer interfaces • Peer Management Overview
– Use 'mdiag -R' to diagnose general RM interfaces – Use 'mdiag -S' to diagnose general scheduler
health – Use 'mdiag -R <RMID> --flags=submit-check' to
diagnose peer-to-peer job migration
1
© Cluster Resources, Inc. 152
Sovereignty: Local vs. Centralized Management Policies
Local Admin
Each Admin can manage their own cluster
Submit to either:• Local cluster• Specified cluster(s) in the grid • Generically to the grid
Local Cluster AResources
Grid AllocatedResources
Portion Allocated
to Grid
Local Admin can apply policies to manage:
1. Local user access to local cluster resources
2. Local user access to grid resources
3. Outside grid user access to local cluster resources (general or specific policies)
Local Users
Outside Grid Users
1
2
3
Grid Administration Body
Grid Administration Body can apply policies to manage:
1. General grid policies (Sharing, Priority, Limits, etc.)
1
1
© Cluster Resources, Inc. 153
Data Staging
• Data Staging
• Data Staging Models
• Interface Scripts for a Storage Resource Manager
1
© Cluster Resources, Inc. 154
Data Staging
• Manages intra-cluster and inter-cluster job data staging requirements so as to minimize resource inefficiencies and maximize system utilization
• Prevent the loss of compute resources due data blocking and can significantly improve cluster performance.
1
© Cluster Resources, Inc. 155
Data Management: Increasing Efficiency
4
Data Staging Levels of Efficiency and Control:
0. No data staging. 1. Non-Verified Data Staging is the traditional use of data staging where CPU requests and data staging requests are not
coordinated, leaving the CPU request to cause blocking on the compute node when the data is not available to process. 2. Verified Data Staging is the added intelligence to have the workload manager verify that the data has arrived at the
needed location prior to launching the job, in order to avoid workload blocking. 3. Prioritized Data Staging uses the capabilities of Verified Data Staging, but adds the ability to intercept the data staging
requests and to submit them in an order of priority that matches that of the corresponding jobs.4. Fully Scheduled Data Staging uses all of the capabilities of Prioritized Data Staging, but adds the ability to estimate
staging periods, thus allowing workload to be scheduled more intelligently around data staging conditions. This capability, unlike the others can be applied to both external and internal storage scenarios, while others simply apply to external storage.
1
3
Fully Scheduled Data Staging
Prioritized Data Staging
Verified Data Staging
Non-Verified Data Staging
No Data Staging
2
0
Traditional Data Staging
Optimized Data Staging
1
© Cluster Resources, Inc. 156
Optimized Data Staging
• Automatically pre-stages input data and stages back output data with event policies
• Coordinate data stage time with compute resource allocation
• Use GASS, gridftp, and scp for data management
• Reserve network resources to guarantee data staging and inter-process communication
Prestage Stage BackProcessing
Prestage Stage Back
Processing
CPU Reservation
CPU ReservationReservation Reservation
Traditional Inefficient
Method
Optimized Data Staging
Compute resources are wasted/ “Blocked” during data staging
Compute resources are available to other workload during data staging
1
© Cluster Resources, Inc. 157
Efficiencies from Optimized Data Staging
Traditional Inefficient
Method
Intelligent Event-basedData Staging
Prestage Stage BackProcessingReservation
Prestage Stage BackProcessingReservationEvent Event
Prestage Stage BackProcessingReservation
Prestage Stage BackProcessingReservation
Prestage Stage BackProcessingReservation
Prestage Stage BackProcessingReservationEvent Event
Prestage Stage BackProcessingReservationEvent Event
Prestage Stage BackProcessingReservationEvent Event
Prestage Stage BackProcessingReservationEvent Event
Prestage Stage BackProcessingReservationEvent Event
Prestage Stage BackProcessingReservationEvent Event
Prestage ProcessingReservationEvent
•4 Jobs Completed
•7.5 Jobs Completed•Efficient use of CPU•Efficient use of Network
Processor Start Time
1
© Cluster Resources, Inc. 158
Data Staging Models
Attribute Description
TYPE must be NATIVE in all cases
RESOURCETYPE must be set to STORAGE in all cases
SYSTEMQUERYURL specifies method of determining file attributes such as size, ownership, etc.
CLUSTERQUERYURL specifies method of determining current and configured storage manager resources such as available disk space, etc.
SYSTEMMODIFYURL specifies method of initiating file creation, file deletion, and data migration
• Verified Data Staging • Prioritized Data Staging • Fully-Scheduled Data Staging • Data Staging to Allocated Nodes
1
© Cluster Resources, Inc. 159
Verified Data Staging
ClusterA
Moab
Local Grid Rules
4
Verified Data Staging (Start job after the file is verified to be in the right location):
To prevent job blocking caused by jobs who’s data has not finished data staging when all data staging is controlled via external data managers and no methods exist to control what is staged or in what order:
1. User submits jobs via portal, or job script like mechanism. Data staging needs are communicated to a data manager mechanism (HSM manager, staging tool, script, command, etc.). Job consideration requests are sent to Moab in order to decide how and when to run.
2. Moab periodically queries storage system (SAN, NAS, Storage Nodes) to see if the file is “there yet”.
3. The data manager moves the data to the desired location when it is able.
4. Moab verifies that the file is “there”, then releases the job for submission as long as it satisfies established policies.
1
StorageSystem
Data Manager
Job Submission
3Benefits:Prevents non-staged jobs from blocking usage of nodes
Drawbacks:No job-centric prioritization takes place in the order of which data gets staged first
2
1
© Cluster Resources, Inc. 160
Prioritized Data Staging
ClusterA
Moab
Local Grid Rules
5
Prioritized Data Staging (priority order of data staging):
When Moab intercepts data staging requests & submits them through a data manager according to priority order:
1. User submits jobs via portal, or job script like mechanism. Data staging needs and Job consideration requests are sent to Moab in order to decide how and when to run and to decide priority order of submitting data staging requests.
2. Moab evaluates priority, reservations and other factors, and then submits data staging requests to a data manager mechanism (HSM manager, staging tool, script, command, etc.) in the best order to match established policies.
3. Moab periodically queries storage system (SAN, NAS, Storage Nodes) to see if the file is “there yet”.
4. The data manager moves the data to the desired location when it is able.
5. Moab verifies that the file is “there”, then releases the job for submission as long as it satisfies established policies.
1
StorageSystem
Data Manager
Job Submission
4Benefits:Prevents non-staged jobs from blocking usage of nodes Provides soft prioritization of data staging requests
Drawbacks:Prioritization is only softly provided Insufficient information for informed CPU reservations to take place
3
Priority Jobs First
2
1
© Cluster Resources, Inc. 161
Fully Scheduled Data Staging: External StorageFully Scheduled Data Staging (priority order of data staging and data-staging centric scheduling):
When Moab intercepts data staging requests to manage data staging order & reserves CPU and other resources based on estimates of data staging periods:
1. User submits jobs via portal, or job script like mechanism. Data staging needs and Job consideration requests are sent to Moab in order to decide how and when to run and to decide priority order of submitting data staging requests.
2. Moab evaluates data size and network speeds to estimate data staging duration, then uses this estimate to reserve manage submission of data staging requests and reservations of CPUs and other resources.
3. Moab evaluates priority, reservations and other factors, and then submits data staging requests to a data manager mechanism (HSM manager, staging tool, script, command, etc.) in the best order to match established policies.
4. Moab periodically queries storage system (SAN, NAS, Storage Nodes) to see if the file is “there yet”. 5. The data manager moves the data to the desired location when it is able. 6. Moab verifies that the file is “there”, then releases the job for submission as long as it satisfies established policies.
Benefits:Prevents non-staged jobs from blocking usage of nodes Provides soft prioritization of data staging requestsIntelligently schedule resources based on data staging information
Drawbacks:Prioritization is only softly provided
ClusterA
Moab
Local Grid Rules
6
StorageSystem
Data Manager
Job Submission
5Priority
Jobs First
32
4
1
1
© Cluster Resources, Inc. 162
Fully Scheduled Data Staging (priority order of data staging and data-staging centric scheduling):
When Moab intercepts data staging requests to manage data staging order & reserves CPU and other resources based on estimates of data staging periods:
1. User submits jobs via portal, or job script like mechanism. Data staging needs and Job consideration requests are sent to Moab in order to decide how and when to run and to decide priority order of submitting data staging requests.
2. Moab evaluates data size and network speeds to estimate data staging duration, then uses this estimate to reserve manage submission of data staging requests and reservations of CPUs and other resources.
3. Moab evaluates priority, reservations and other factors, and then submits data staging requests to a data manager mechanism (HSM manager, staging tool, script, command, etc.) in the best order to match established policies.
4. Moab periodically queries storage system (SAN, NAS, Storage Nodes) to see if the file is “there yet”. 5. The data manager moves the data to the desired location when it is able. 6. Moab verifies that the file is “there”, then releases the job for submission as long as it satisfies established policies.
Benefits:Prevents non-staged jobs from blocking usage of nodes Provides soft prioritization of data staging requestsIntelligently reserves resources based on data staging information
Drawbacks:Prioritization is only softly provided
ClusterA
Moab
Local Grid Rules
Data Manager
Job Submission
52
4S
S S
Storage is on Local Compute Nodes
1
3Priority
Jobs First
6
Fully Scheduled Data Staging: Local Storage
1
© Cluster Resources, Inc. 163
Data Staging Diagnostics• Checkjob
– Stage type - input or output – File name - reports destination file only – Status - pending, active, or complete – File size - size of file to transfer – Data transferred - for active transfers, reports number of bytes
already transferred
• Checknode– Active and max storage manager data staging operations – Dedicated and max storage manager disk usage – File name - reports destination file only – Status - pending, active, or complete – File size - size of file to transfer – Data transferred - for active transfers, reports number of bytes
already transferred
1
© Cluster Resources, Inc. 164
Interface Scripts for a Storage Resource Manager
• Moab's data staging capabilities can utilize up to 3 different native resource manager interfaces– Cluster Query Interface– System Query Interface– System Modify Interface
1
© Cluster Resources, Inc. 165
Prioritized Data Staging Example
#moab.cfgRMCFG[data] TYPE=NATIVE RESOURCETYPE=STORAGE RMCFG[data] SYSTEMQUERYURL=exec:///opt/moab/tools/dstage.systemquery.pl RMCFG[data] CLUSTERQUERYURL=exec:///opt/moab/tools/dstage.clusterquery.pl RMCFG[data] SYSTEMMODIFYURL=exec:///opt/moab/tools/dstage.systemmodify.pl
1
© Cluster Resources, Inc. 166
Information Services
• Monitoring performance statistics of multiple independent clusters
• Detecting and diagnosing failures from geographically distributed clusters
• Tracking cluster, storage, network, service, and application resources
• Generating load-balancing and resource state information for users and middleware services
1
© Cluster Resources, Inc. 167
8. Accounting and Statistics
• Job and System Statistics
• Event Log
• Fairshare Stats
• Client Statistic Reports
• Realtime and Historical Charts with Moab Cluster Manager
• Native Resource Manager– GMetrics
– GEvents
1
© Cluster Resources, Inc. 168
Accounting Overview
• Job and Reservation Accounting
• Resource Accounting
• Credential Accounting
#moab.cfg
USERCFG[DEFAULT] ENABLEPROFILING=TRUE
1
© Cluster Resources, Inc. 169
Job and System Statistics
• Determining cumulative cluster performance over a fixed timeframe
• Graphing changes in cluster utilization and responsiveness over time
• Identifying which compute resources are most heavily used • Charting resource usage distribution among users, groups,
projects, and classes • Determining allocated resources, responsiveness, and/or failure
conditions for jobs completed in the past • Providing real-time statistics updates to external accounting
systems
1
© Cluster Resources, Inc. 170
Event Log
• Report trace state and utilization records at events– Scheduler start, stop and failure
– Job create, start, end, cancel, migrate, failure
– Reservation create, start, stop, failure
– Configurable with RECORDEVENTLIST
– Can be exported to external systems
http://clusterresources.com/moabdocs/a.fparameters.shtml#recordeventlist
http://clusterresources.com/moabdocs/14.2logging.shtml#logevent
1
© Cluster Resources, Inc. 171
Fairshare stats
• Provide credential-based
usage distributions over time
• mdiag –f
• Maintained for all credentials
• Stored in stats/FS.${epochtime}
• Shows detailed time-distribution usage by fairshare metric
1
© Cluster Resources, Inc. 172
Client Statistic Reports
• In-Memory reports available for nodes and credentials
• Node categorization allows fine-grained localized usage tracking
1
© Cluster Resources, Inc. 173
Realtime and Historical Charts with Moab Cluster Manager
• Reports nodes and all creds
• Allows arbitrary querying of historical timeframes with arbitrary correlations
1
© Cluster Resources, Inc. 174
Service Monitoring and Management
1
© Cluster Resources, Inc. 175
Real-Time Performance & Accounting Analysis
1
© Cluster Resources, Inc. 176
10. End Users
• Moab Access Portal
• Moab Cluster Manager
• End User Commands
• End User Empowerment
1
© Cluster Resources, Inc. 177
Moab Access Portal TM
• Submit Jobs from a web browser
• View and Modify only your own Workload
• Assist end-users to self-manage behaviours
http://clusterresources.com/map
1
© Cluster Resources, Inc. 178
Moab Cluster Manager TM
• Administer Resources and Workload Policies Through an Easy-to-Use Graphical User Interface
• Monitor, Diagnose and Report Resource Allocation and Usage
http://clusterresources.com/mcm
1
© Cluster Resources, Inc. 179
End User CommandsCommand Flags Description
canceljob cancel existing job
checkjob display job state, resource requirements, environment, constraints, credentials, history, allocated resources, and resource utilization
releaseres release a user reservation
setres create a user reservation
showbf show resource availability for jobs with specific resource requirements
showq display detailed prioritized list of active and idle jobs
showstart show estimated start time of idle jobs
showstats show detailed usage statistics for users, groups, and accounts which the end user has access to
1
© Cluster Resources, Inc. 180
Assist Users in Better Utilizing Resources
• General info
• Job eval
• Completed job failure post-mortem
• Job start time estimates
• Job control
• Reservation control
1
© Cluster Resources, Inc. 181
Assist Users in Better Utilizing Resources (contd)
• How do You Evaluate a Request– showstart (Earliest start, completion time, etc.)
– showstats –f (General service level statistics)
– showstats –u (User Statistics)
– showbf (Immediately available resources)
Recommended