11
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by the Department of Energy’s Office of Science Office of Advanced Scientific Computing Research

Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by

  • View
    215

  • Download
    2

Embed Size (px)

Citation preview

Presented by

Scalable Systems Software Project

Al GeistComputer Science Research Group

Computer Science and Mathematics Division

Research supported by the Department of Energy’s Office of ScienceOffice of Advanced Scientific Computing Research

2 Geist_SSS_SC07

www.scidac.org/ScalableSystems

Participating organizations

The teamCoordinator: Al Geist

DOE Laboratories

VendorsNSFSupercomputer

Centers

Open to all, like MPI forum

3 Geist_SSS_SC07

System administrators and managers of terascale computer centers are facing a crisis:

The problem

Computer centers use incompatible,ad hoc sets of systems tools.

Present tools are not designed to scaleto multiteraflop systems – tools must be rewritten.

Commercial solutions are not happening because business forces drive industry toward servers, not high-performance computing.

4 Geist_SSS_SC07

Scope of the effort

Improve productivity of both usersand system administrators

Systembuild and configureJob

management

Systemmonitoring

Resourceand queue

management

Accountingand user

management

Allocationmanagement

Faulttolerance

Security Checkpoint/restart

5 Geist_SSS_SC07

Reduced facilitymanagement costs

More effective use of machines by scientific applications

Impact

Reduce duplication of effort in rewriting components

Reduce need to supportad hoc software

Better systems tools available

Able to get machines up and running faster and keep running

Especially important for LCF

Scalable launch of jobsand checkpoint/restart

Job monitoring and management tools

Allocation management interface

Fundamentally change the way future high-endsystems software is developed and

distributed

6 Geist_SSS_SC07

System software architecture

Testing and

validation

Testing and

validation

User utilitiesUser

utilities

High-performance communication

and I/O

High-performance communication

and I/O

Checkpoint/ restart

Checkpoint/ restart File systemFile systemUsage

reportsUsage reports

Allocation management

Allocation management

User database

User database

Queue managerQueue

managerJob manager and monitor

Job manager and monitor

SchedulerScheduler System monitorSystem monitor

Node configuration

and build manager

Node configuration

and build manager

AccountingAccounting

Access control security manager

(interacts withall components)

Access control security manager

(interacts withall components)

Meta scheduler

Meta scheduler

Metamonitor

Metamonitor

Meta manager

Meta manager

Data migration

Data migration

Application environment

7 Geist_SSS_SC07

Highlights

Designed modular architecture

Allows site to plug and play what it needs

Defined XML interfaces

Independent of language and wire protocol

Reference implementation released

Version 1.0 released at SC2005

Production users ANL, Ames, PNNL, NCSA

Adoption of API Maui (3000 downloads/month)

Moab (Amazon.com, Ford, …)

8 Geist_SSS_SC07

Progress on integrated suiteComponentscan be writtenin any mixtureof C, C++, Java, Perl, and Python.

Standard XMLinterfaces

Node state manager

Node state manager

Meta scheduler

Meta scheduler

Metamonitor

Metamonitor

Meta manager

Meta manager

Metaservices

Service directoryService

directory

Event manager

Event manager

Authentication CommunicationAuthentication Communication

Allocation management

Allocation management

SchedulerScheduler System and job monitorSystem and job monitorAccountingAccounting

Node configuration

and build manager

Node configuration

and build manager

Testing and validation

Testing and validation

Usage reportsUsage reports

Job queue manager

Job queue manager

Process managerProcess manager

Checkpoint/ restart

Checkpoint/ restart

SSS-OSCAR

SSS-OSCAR

Hardware infrastructure

manager

Hardware infrastructure

manager

9 Geist_SSS_SC07

Production users

Running a full suite in productionfor over a year

Argonne National Laboratory:200-node Chiba City and BG/L

Ames Laboratory

Running oneor more components in production

Pacific Northwest National Laboratory: 11.4-TF cluster + others

NCSA

Running a full suiteon development systems

Most participants

Discussions with DOD-HPCMP sites

Use of our schedulerand accounting components

10 Geist_SSS_SC07

Adoption of API

Maui scheduler now uses our API in client and server

3,000 downloads/month.

75 of the top 100 supercomputersin the top 500.

Commercial Moab scheduler usesour API

Amazon.com, Boeing, Ford, Dow Chemical, Lockheed Martin, more…

New capabilities added to schedulers due to API

Fairness, higher system utilization, improved response time.

Discussionwith Cray: Leadership-class computers

Don Mason attended our meetings

Plan to use XML messages to connect their system components.

Exchanged info on XML format, API test software, more … use of our schedulerand accounting components.

11 Geist_SSS_SC07

Contact

Al GeistComputer Science Research GroupComputer Science and Mathematics Division(865) [email protected]

11 Geist_SSS_SC07