Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
CERES AuTomAted job Loading sYSTem (CATALYST)
Test Readiness Review
March 22, 2013
1
Background & Scope
Background: NPR 7120.7 - Test Readiness Review (TRR) for IT
consulted for review structure.
Review Scope: CATALYST – Workflow management tool designed
to construct, disposition & submit PGE jobs for CERES data production based on PGE / data range Production Requests (PRs)
2
TRR Entrance Criteria 1 Objectives of testing clearly defined and documented. All plans,
procedures, environment, and configuration of the test system support those objections.
2 Configuration of system under test is defined and agreed to. All interfaces placed under CM or defined in accordance with an agreed-to plan and version description document has been made available to TRR participants prior to review.
3 All Applicable functional, unit level, subsystem, system, and qualification testing has been conducted successfully.
4 All TRR specific materials provided to participants prior to review. (test plans, test cases, test procedures, etc)
5 All known system discrepancies have been identified and dispositioned according to agreed-upon plan.
6 All previous design review success criteria and key issues resolved.
7 All required test resources – people, facilities, and other enabling products – have been identified and are available to support testing.
8 Roles and responsibilities of all test participants defined and agreed to.
9 Test contingency planning accomplished and personnel trained.
CATALYST Test Readiness Review
System Description
Development Team: Nelson Hillyer Joshua Wilkins Tammy Ayers 4
CATALYST System Overview
5
What is CATALYST?
● PGE Execution/Coordination/Logging Framework – Implements unique processing flow requirements for CERES – Ingests PRs to Create a Collection of Jobs
● Job represents a single datadate of a single PGE ● Jobs execute on Sun Grid Engine
– CATALYST jobs wait for predecessor jobs to complete ● Predecessors jobs can be:
– Internal to CATALYST – External to CATALYST (via a backlogging interface)
– CATALYST jobs broadcast completion status to follow-on jobs for rapid follow-on execution
– CATALYST jobs store execution state long-term in a job logging database
6
What is CATALYST? - cont.
● Application Programming Interface (API) – Externally accessible XML-RPC API – Allows users to inspect and modify job execution
status programmatically in any language (with XML-RPC libraries)
– Authentication handled through LDAP – Permissions handled with Access Control List
7
CATALYST Implementation
● CATALYST Server – Job controller – XML-RPC server – Written in Perl and C – Multi-threaded
● Operator's Console – Graphical User
Interface – Communicates to
CATALYST Server using XML-RPC protocol
– Written in Java – Multi-threaded
8
CATALYST Server Components
9
CATALYST Server Components – XML-RPC Front-End
● External facing programming interface
● Multi-threaded to handle concurrent client connections
● Customized version of RPC::XML from Perl CPAN
– Switched to use threads, instead of forking
● Verifies users using the User Manager
● Production Requests enter through here from:
– PR Web Application
– Standalone Test PR Submission Applications (included in this delivery)
10
CATALYST Server Components – User Manager
● Authenticates AMI users against AMI LDAP server
● Maintains ACL to grant AMI users with CATALYST operation permissions
11
CATALYST Server Components – PGE Handlers Pool
● Collection of job generators tailored for each PGE
● Follows Factory Pattern
● Inputs – Production Request
● Outputs – List of PGE jobs
12
CATALYST Server Components – POSIX_SYCALL Pool
● Invokes child processes ● Used to prevent
environment variable pollution amongst AJSS and ANGe Epilog calls
● Returns exit codes of child processes to the caller (if necessary)
● Multi-threaded with 8 primary execution slots – To maximize job launching – To minimize overloading 13
CATALYST Server Components – Logging Database Interface
● Handles insertion/modification/ querying for CATALYST job objects
● Translates CATALYST job objects into series of SQL queries
● Results returned as CATALYST job objects
14
CATALYST Server Components – Cluster Resource Monitor
● Assembles SGE and Ganglia information for cluster nodes
● Presently used for presentation in the CATALYST Operator's Console
● Future use will be for dynamic load balancing – Help with running I/O
intensive PGEs
15
CATALYST Server Components – DRMAA Interface
● Submits CATALYST Jobs to Sun Grid Engine using Distributed Resource Management Application API (DRMAA)
● Applies: – [Re]starting – Pausing/Resuming
● Collects: – resource usage data – exit status 16
CATALYST Server Components – CATALYST Core
● Ingests, monitors, executes, and deletes CATALYST jobs: – Leveraging previously
mentioned components
● Broadcasts job completion to follow on PGEs/PRs
● Job data paged in and out of memory to SQLite3 files on disk – Reduces memory footprint – Resumes to state prior to a
shutdown 17
CATALYST Operator's Console
1. Status bar and menu
2. PR list
3. PR status
4. PR chunk details
5. Job navigation
6. Job table
7. Job actions
8. Blade list
9. Log
18
CATALYST Operator's Console – Service Components
● SSH Tunnel:
– Uses the JCraft SSH2 library
– Establishes an SSH tunnel to the selected host – users login with LDAP SSH credentials
– All traffic from the console to server is routed over this tunnel
● XML-RPC Handler:
– Uses a custom modified version of the Apache XML RPC library – added timeouts to RPC requests
– Manages all RPC requests and callbacks
● PR Handler:
– Manages PR and Job information asynchronously of the User Interface and notifies the UI components when data has been retrieved
● Blade Handler:
– Similar to PR Handler but with blade information
● Preferences Handler:
– Manages loading and saving of user preferences such as ports
● TextArea Handler:
– Handles appending log data from the internal logger to the log panel in the UI
19
CATALYST Job Life-Cycle
● Uninitialized – Job object created from PR, stored on disk, not yet recorded to logging database
● Submitted – Job is now recorded to logging database and has an CATALYST Job ID
● Waiting – Job is waiting for notification of predecessor completion
● Scheduled – Job is ready to run
● Executing – Job is actively running
● Completed – Job has finished, or has been deleted
20
Configuration Management
21
Configuration Management
● Subversion used throughout development ● CATALYST code deliveries closely match that
of existing delivery strategy – Major difference is the use of self-extracting
installer to handle both code, database, and configuration file setup
– All future code/database deliveries or patches will also be done using an installer
22
CATALYST Pre-Release Testing
23
Pre-Release Development/CM Testing
● CM Testing conducted with development team – Bug fixes implemented in conjunction with CM testing
● Initial Aqua-only January 2005 Clouds/Inversion Test – Aqua-only due to PGE/PR datadate interleaving bug with completion
broadcast – Initially submitted 3 days out of PRs
– Appended 7 more days to active PRs
– Appended 10 more days to active PRs
● Full Month of Concurrent Terra/Aqua Clouds/Inversion
● Fresh Install Full Month of Concurrent Terra/Aqua – January 2005 – End-to-end Clouds and Inversion exited successfully
24
Known System Bugs
25
Known System Bugs
● CATALYST Server – Gradual memory leak
identified in CPAN's Schedule::DRMAAc C/Perl binding (fix in progress, ~1 week) ● Will not interrupt
testing of PGEs under CATALYST for included Test PR datadate ranges
● Operator's Console – Minor known
graphical bugs (such as not all the blades in the blade list fully display the blade name) ● Will not interrupt
testing of PGEs under CATALYST
26
Dependencies
27
External Dependencies
● Required to Run: – DMT Maintained:
● Perl 5.16.1 and Perl Libraries
● AMI Job Submission Scripts (per PGE)
– ASDC Maintained: ● ANGe Epilogs ● Sun (Univa) Grid Engine
(ops.q) ● PostgreSQL Server
(dsrvr205) ● LDAP Server
(ab01.cluster.net)
● Optimal: – Ganglia
● Nice to have, will work without
– PR Web Application ● Could use PR submission
scripts similar to included test cases
28
Risks
29
Risks (1 of 2)
● Memory Leak Previously Mention – minor risk for testing – Fixed planned well before being placed in production
● Epilog Interface not tested – minor risk – Unable to test in development or CERES CM environments – Minor code delivery to use pending SGE dedicated archive
queue: ● Will be written in a standalone script called through
POSIX_SYSCALL interface ● Will use dedicated server for archival
30
Risks (2 of 2) ● Shared Resources – minor risk
– Could be some resource competition (SGE slots) between non-CATALYST and CATALYST production and testing
– CATALYST currently has artificial limit of 128 concurrent running SGE jobs ● Plenty for running concurrent Terra and Aqua Clouds
and Inversion – No crosstalk amongst CATALYST and non-
CATALYST code ● Separate Sampling Strategies, Configuration Codes,
and $CERESHOMEs will keep this from happening 31
CATALYST Test Readiness Review
PPE Testing
33
• Phase 1: Test Case based on System Requirements defined in CATALYST Requirements Document baselined 03/26/12 (Functional Testing) o incorrect or missing functionality o Interface errors o Behavior or performance errors o Initialization and termination errors o Exploratory testing o Submission of 4 months of data that will cover February, year and month boundaries
NOTE: General concept is to determine if the code performs according to documented requirements
• Phase 2: Operational Scenarios (ex., day-in-life, proof of concept)
Test execution with CATALYST and non-CATALYST PGEs
PPE Test Approach (1 of 2)
34
• Phase 3: PGE input and output verification based on existing CERES documentation
• Phase 4: End-To-End Testing o PRDB and Epilogue Interface Testing
PPE Test Approach (2 of 2)
35
• All Production Requests (PRs) associated with PPE Testing are entered in the PPE PR database and are accurate
• All Clouds and Inversion PGEs for PPE Testing have been certified for production
• Subsystem Operator’s Manuals associated with PGEs are up to date
• Staff members have been trained with Operator’s Console • Epilogue scripts are operational • The DPO is accessible • PPE PRDB is operational
PPE Test Assumptions
36
• The ability to accept electronic production requests • The ability to interpret environment variables and special parameters for all
instances as defined in the production request (PR) and Subsystem Operator’s Manual
• The ability to gracefully shut down • The ability to restart and restore from the state recorded prior to shutdown • The ability to submit CERES PGEs to a job scheduler • The ability to determine PGE preconditions prior to job submissions • The ability to manage/throttle jobs • The ability to track all job instances • The ability to log results of job submissions • The ability to interface with the CERES Epilogue scripts • Compliance with NASA LaRC IT security requirements • Access Controls • Usability of Operator’s Console
Overview of Test Components
37
• Production Request Database • CERES Epilogue During the final stages of PPE testing, an End-To-End Test of the interface of CATALYST with the PRDB and Epilogue will be conducted to ensure operability of the production system.
Features NOT To Be Tested
38
Role/Responsibility Contact
Oversight Lindsay Parker
Progress Reporting ASDC: Tonya Davenport, Lindsay Parker CATALYST: Nelson Hillyer and Josh Wilkins
Test Plan Development Tonya Davenport, Tammy Ayers
Test Execution Operations Staff– Reynold Byrd, Lead SIT Staff – Angel Cross, Sharon Dukes-Allen, Vertley Hopson, Tonya Davenport, Lead
CATALYST Development Support Nelson Hillyer and Josh Wilkins
CM Support CERES :Tammy Ayers ASDC – Sharon Dukes-Allen, Tonya Davenport
Environment Configuration Support/ System Monitoring
ASDC SA Staff
Database Support Maria Wilson
PRDB Content Support Lisa Coleman
CERES Epilogue Support Don Rieger
Metrics Support Nelson Hillyer and Josh Wilkins
PPE Test Status Reporting Tonya Davenport, Lindsay Parker
PGE Output Verification: Clouds and Inversion Subsystem Representatives
JIRA Support Crystal Freebourn
Roles and Responsibilities
39
• Machine: • PPE testing currently done on ami-p (ab01) (Production System)
• CERES current production software installed and verified in PPE • Clouds(809, 918, 924, 943, 950)
o CER4.1-4.1P6 o CER4.1-4.2P5 o CER4.1-4.2P4 o CER4.1-4.3P3
• Inversion(814, 912, 913, 939, 948) o CER4.5-6.1P4 o CER4.5-6.1P5 o CER4.5-6.2P3 o CER4.5-6.4P2
• Perl_lib (SCCR951) • CERESlib (SCCR928)
PPE Environment – Science S/W
40
• JIRA will serve as the primary tool for reporting • Project: CERES-ASDC (CER) • JIRA accounts established on 02/05/13:
• All SIT members and Operations Staff • Tammy Ayers • Jonathan Gleason • Nelson Hillyer • Walt Miller • Pamela Rinsland • Josh Wilkins
• JIRA training for CATALYST users held on 02/05/2013
Issue / Defect Reporting (1 of 2)
41
• SIT Team will submit JIRA reports using the Test Discrepancy Report (TDR) workflow during PPE Testing
• TDRs will be assessed by the CATALYST development team and recommendations entered into the JIRA report
• Major/Critical Problems will be escalated to the entire CATALYST Team
• Software updates will be received via the agreed CM Process
Issue / Defect Reporting (2 of 2)
42
JIRA Workflow Overview Test Discrepancy Report (TDR)
43
• Provide Weekly Status Report to include: • Issues and Accomplishments • Next Week Priorities • Test Metrics
• At the conclusion of PPE Testing: • Test Summary Report • Test Execution Results • Test Traceability Matrix • Test Cases
Test Reporting
44
Test Status Reporting Test Case Template (1 of 2)
45
Test Status Reporting Test Case Template (1 of 2)
46
CATALYST Weekly Summary Report (1 of 3)
47
CATALYST Weekly Summary Report (2 of 3)
48
CATALYST Weekly Summary Report (3 of 3)
49
• 4 SIT Testers • 2 Operators • Clouds & Inversion Subsystem Representative • Resource constraints
• Angel Cross – Testing the CERES Clouds delivery (SCCR 947) Beta2-Edition4
• Tester Jury Duty (March)
Test Resources
TRR Success Criteria 1 Adequate test plans are completed and approved for system under
test.
2 Adequate identification and coordination of required test resources is complete.
3 Previous component, subsystem, and system test results form satisfactory basis for proceeding into planned tests.
4 Risk level is identified and accepted by program leadership, as required.
5 Plan to capture any lessons learned from test program is established.
6
The objectives of the testing have been clearly defined and documented, and the review of all the test plans, procedures, environment and configuration of test item provide a reasonable expectation that test objectives will be met.
7 Test cases have been reviewed and analyzed for expected results, and the results are consistent with the test plan and objectives.
8 Test personnel have received appropriate training.
51
• CATALYST pre-delivery software was installed in the PPE environment on ab01 03/15/13
• Initial Acceptance Test Cases submitted by 3 of 4 testers
• Test Plan and Operations Manual received • Adequate Test Resources are available (3 of 4
testers, 2 operators) • Risks identified are acceptable for Build 1 PPE
Testing
Therefore , CATALYST is ready for formal testing in PPE
TRR Recommendation