27
A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

Embed Size (px)

Citation preview

Page 1: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

A View from the TopEnd of Year 1

A View from the TopEnd of Year 1

Al GeistOctober 10-11

Houston TX

Page 2: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

www.scidac.org/ScalableSystems

Coordinator: Al Geist

Participating Organizations

ORNLANLLBNLPNNL

PSCSDSCIBM

SNLLANLAmesNCSA

CrayIntelUnlimited Scale

Participating OrganizationsParticipating Organizations

Main Web SiteMain Web Site

Page 3: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

Scalable Systems Software CenterJune 13-14Houston TX

Review of Last MeetingReview of Last Meeting

Details inMain project notebook

Page 4: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

Progress Reports at June. mtgProgress Reports at June. mtg

Al Geist – working groups, notebooks, telecoms

Working Group Leaders –What areas their working group is addressing Progress report on what their group has done Present problems being addressed Next steps for the group Discussion items for the larger group to consider

Demonstrations of Prototype ComponentsOne Big intra-component demo

Slides can be found in Main Notebook page 22

Page 5: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

Consensus and Voting:Consensus and Voting:

Event Manager Proposal: Much discussion: revised proposal to say that Event Management is important feature to our Software Suite independent of whether it is in a central component or inside components. And that proposed tuple API is initial starting point.

Passed strawvote 13 for / 0 against / 0 abstainAdopt HTTP POST (byte count) as standard Proposal: Passed strawvote 10 for / 0 against / 1 abstainAdopt W3 standard for XML signature syntax and process: Long discussion. Decided more discussion needed before voteBugzilla site now up and running Link is on the ScalableSystems home page.

Page 6: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

Scalable Systems Software Center

June-October

Progress Since Last MeetingProgress Since Last Meeting

Page 7: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

Five Project Notebooks filling upFive Project Notebooks filling up

A main notebook for general information

And individual notebooks for each working group

• Over 200 total pages – 34 added since last meeting

• A lot of new material in Resource Management notebook (way to go)

Get to all notebooks through main web site www.scidac.org/ScalableSystems

Click on side bar or at “project notebooks” at bottom of page

Page 8: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

Four Bi-weekly Working Group Telecoms Less talk more work

Resource management, scheduling, and accounting

Tuesday 3:00 pm (Eastern) 1-800-664-0771 keyword “SSS mtg”

Validation and Testing

Wednesday 1:00 pm (Eastern) 1-877-540-9892 mtg code 999157

Proccess management, system monitoring, and checkpointing

Thursday 1:00 pm (Eastern) 1-877-252-5250 mtg code 160910

Node build, configuration, and information service

Friday 3:00 pm (Eastern) 1-888-469-1934 mtg code 58145 (changes)

Page 9: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

Scalable Systems Integrated Component Demonstration

QueueManager

AllocationManager

NodeMonitor

LocalScheduler

ProcessManager

DiscoveryService

Color Key

Working Group

Resource Management and Accounting

Process Management and Monitoring

Node Configuration and Build Infrastructure

JobSubmission

Client

1 Submit-Job

3 Query-N

ode6

Exe

c-Pr

oces

s

4 Create-Reservation

2 Query

-Job

5 Run-Jo

b

8 Dele

te-Job

0 Service

-Lookup

7 Query

-Job

9 Withdraw-Allocation

Done June 2002

Page 10: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

Authentication &Communication

R. Lusk

MetaSchedulerD. Jackson

MetaManagerS. Scott

AccountingS. Jackson

SchedulerD. Jackson

System/JobMonitors

M. Showerman

PackageServices

J. Mugler

InformationServices

JP Navaro

AllocationManagement

S. Jackson

QueueManagerB. Bode

JobManagerB. Bode

Checkpoint /Restart

P. Hargrove

ProcessManagerR.Lusk

ServiceDirectoryN. Desai

NodeManager

T. Naughton

C-PlantXML interface

E. Debenedictis

Resource MgmtWorking Group

Build & ConfigureWorking Group

Process MgmtWorking Group

SSSlibUsed by all components

Page 11: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

Scalable Systems Software Center

October 10-11,2002

This MeetingThis Meeting

Page 12: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

SciDAC BoothSciDAC Booth

Page 13: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

SciDAC Systems PosterSciDAC Systems Poster

Page 14: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

SciDAC BoothSciDAC Booth

Page 15: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

SciDAC Systems Poster (2)SciDAC Systems Poster (2)

Page 16: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

Agenda – October 10Agenda – October 10

8:00 Breakfast 8:30 Al Geist – Project Status. Getting ready for SC 2002 9:00 External Project review – Feburary (start planing) Working Group Reports 9:30 Scott Jackson – Resource Management10:30 Break11:00 Erik Debenedictis – Validation and Testing 12:00 Lunch (on own but go somewhere as group) 1:00 Paul Hargrove – Process Management 2:00 Narayan Desi – Node Build, Configure 3.00 Break 3:30 SC Demos and Hacking

big multi-component demo 5:00 Open Discussion 5:30 Adjourn Working groups may wish to get together in evening

Page 17: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

Agenda – October 11Agenda – October 11

8:00 Breakfast 8:30 Discussion, proposals, strawvotes

THANKS to Airport Security Meeting for open access to their internet access!ssslibmeatball GUI (who?)Chiba City for SC demos (Nov 4?)cross group issuestest packaging?

10:30 Break11:00 Al Geist – Summary SC Booth, demos, theater, software, handout (Brett)

February review – reviewers, advisor, talks next meeting date: day before review12:00 meeting ends

Page 18: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

External SciDAC Review mtgExternal SciDAC Review mtg

Late February 2003 – may bubble over to early March 18 month checkup by MICS

Each SciDAC Project is reviewed separately – Scalable Systems is the only thing on the agenda

Full two days of detailed presentationsSo many of us will have to give presentations

External review panel (different for each ISIC)We can suggest names Can’t be from our organizations or affiliated They will have been given our proposal beforehand

Page 19: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

External SciDAC Review metricsExternal SciDAC Review metrics

I asked Fred and McGraw about Metrics:

1. How have we helped SciDAC Aps?Can we show use in CCS and NERSC and others.

2. Put Advisory Panel into place.Apps and Computer Center personnelI’ve asked Drake (Climate), Mezzacapa (Astro), Bland (CCS), Nichols (Chemistry)

we need NERSC rep and others?3. Show short term successes and use

Page 20: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

External Review Panel SuggestionsExternal Review Panel Suggestions

External review panel (different for each ISIC)We can suggest names - who?

Barney McCabeRuss MillerBart MillerJose M (IBM)Someone from CraySomeone from Etnus – John DelsignoreSomeone from Unlimited Scale? Walt LigonAndrew LumsdaineJim Garlick

Steve Chapin

Page 21: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

Meeting NotesMeeting Notes

Scott Jackson – rm progressScope queue manager, job manager, scheduler, allocation, & metaDemo CCS, NERSC, and Chiba meta-schedule would be goodScheduler- enhance internal scalability to 64K nodes, add support for HTTP framing protocol. Qbank security enhanced Interface to PBS, LSF, LL for suspend/resume and requeue mgtQueue Manager-conforms to SSSRMAP XML spec. full wire protocol compatibility new enterface to Event ManagerAllocation Manager-survey of 15 sites for requirements. Implemented HTTP framing, SHA1-HMAC security working with Qbank/Maui reframed bank objects (accounts, users, allocations) as dynamic object actions defined in metadata cache creation of dynamic web-GUI using PHP and javascriptMeta scheduler – interoperates with Grid (globus), fault tolerance – global jobID tracking, scheduler reconnection. Improved user interfaceCurrent issues – job state mgt, data staging, job signaling, job steps

Page 22: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

Meeting NotesMeeting Notes

Scott Jackson – rm progress (cont.)Next work- prepare for SC demos, scalability testing, BIG thing is release v1.0 RM system. Documentation, security authentication, extend suspend/resume schema beyond what PBS, LL does today Discussion of the need for a scalability testbed.Eric Debenidictis – validation progressCreate machine independent test for testing supercomputer Infrastructure QMTest Tests (from all sources) Value- improved method execute the “SSS Standard Test body”Recent Activity – QMTest on SNL SciDAC cluster, test package definitionWill McClendon – test architecture (diagram in slides)QMTest is scriptable test driver in PythonHTTP based interface – ZopeRunning at SNL and PSCRequires exact match on STDOUT/STDERR

Page 23: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

Meeting NotesMeeting Notes

Will McClendon – test architecture (cont.)QMTest Screenshot and discussion of how tests are done.Raw results need to be interpreted to determine pass or failMike ???- goes over the “package” detailsHow to create a test package to the suite – Package File Layout Make-likeWill present as a proposal tomorrowPaul Hargrove – pm groupProgress – prototyping and development continue how to interface to something we can’t imagine validating schema for process manager node monitor schema createdCheckpoint Manager- types serial checkpoints (independent but potentially multithreaded), done parallel checkpoints (MPI) scalable systems XML interfaces

Page 24: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

Meeting NotesMeeting Notes

Rusty Lusk – process manager (see diagram in his slides)MPD1 (C) overview – added capabilities required by pmWGMPD is one prototype for SSS Process ManagerMPD2 (python) diagram in slides for new designPython about 5X slower with this untuned versionMike Showerman- system monitoring componentCraig Steffen full time on this project and a studentUsing new XML schema defined by Need to write graphical display that uses this new XML interfaceRun a small cluster in NCSA booth with SSS software stackDiscussion – how about an animated meatball diagramPaul returns –Data migration meatball removedNext steps – interfaces continue to stabilize chkpt, PM, monitors Monitoring data. . . Details need defining

Page 25: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

Meeting NotesMeeting Notes

Narayan Desai – Build and configure updateComponents – service directory (solid and on Chiba now), event manager completely rewritten, stable XML, SSSlib robust (bindings for C++, Java, Python, Perl) (wire protocol modules, basic, challenge, http, http-rm)Build and Config Management (third try at the abstraction) cluster HW build system (OSCAR module for this one in the works) node state managerIssues- Abstraction problems with second try. Multiple implementations important to validate abstraction

DEMOS

Page 26: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

MetaScheduler

MetaMonitor

MetaManager

Accounting Scheduler

NodeConfiguration

& BuildManager

User DB

AllocationManagement

Job QueueManager

Process Manager

UsageReports

UserUtilities

HighPerformance

Communication& I/O

FileSystem

Application Environment

Meta Services

Testing & Validation

System &Job Monitor

Event Manager

ServiceDirectory

Checkpoint /Restart

Blue text – uses ssslibRed text – talks ssslib protocol

Refined Picture on Next Slide

Page 27: A View from the Top End of Year 1 Al Geist October 10-11 Houston TX

Accounting

FileSystem

Event Manager

ServiceDirectory

MetaScheduler

MetaMonitor

MetaManager

Scheduler

User DBAllocationManagement

Process Manager

UsageReports

UserUtilities

HighPerformance

Communication& I/O

Application Environment

Meta Services

System &Job Monitor

Checkpoint /Restart

Grid Interfaces

Job QueueManager

TheseInterfaceTo all

NodeConfiguration

& BuildManager