Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
HPC Systems Engineeringin the Interaction Room
Matthias Book
with Morris Riedel, Jülich Supercomputing Centre / UoIand Helmut Neukirchen, University of Iceland
General Software Engineering Challenges
Need to ensure we are building the right software, and we are building it right
But many sources of miscommunication between domain & technology experts: Different vocabulary / areas of competence
Struggling to convey / understand requirements precisely
Struggling to realize what is non-obvious / implicit knowledge / unknown
Struggling to realize what bears particular value / effort / risk
Struggling to convey / understand what is fixed, what is flexible, what is variable
Up-front specifications often do not solve these problems, but just mask them Same struggles put in writing Issues surface later, when they are more expensive to fix
Agile approaches encourage (and actually depend on) more interaction but provide little guidance for communicating about what is really crucial in a project
Matthias Book: HPC Systems Engineering in the Interaction Room 2
The Nature of Software Development
“Because software is embodied knowledge,
and that knowledge is initially dispersed, tacit, latent, and incomplete,
software development is a social learning process.”
Howard Baetjer, Jr.: Software as Capital. IEEE Computer Society Press, 1998
Matthias Book: HPC Systems Engineering in the Interaction Room 3
The Interaction Room
Successful projects require personal, focused discussion of critical project aspects
thorough understanding of application domain,and how it is modeled in software
early recognition of value and effort drivers
early elimination of risks and uncertainties
The Interaction Room is a dedicated room for the project team
where domain and technical stakeholders feel at home
with large whiteboards on the walls
but without a classic conference table
to visualize and discuss key project aspects informally
instead of going over tedious documents
Matthias Book: HPC Systems Engineering in the Interaction Room 4
Process Canvas
Interaction Canvas
Inte
gra
tion C
anva
s
Ob
ject C
anva
s
Example: IR for information system development
Example: Annotated Object Canvasfor an Information System
Matthias Book: HPC Systems Engineering in the Interaction Room 5
A Pragmatic Approach to Conceptualizing Software
Informal, high-level sketches of software models sacrifice formality, consistency, completeness (no strict UML necessary)
in favor of focus, pragmatism, interdisciplinary understanding, value-orientation
Not a replacement for formal software specifications! May well be necessary for certain aspects in later stages,
and can then be delegated to expert groups
Informal sketches serve as catalysts for the identification, understanding and discussion of the most critical project aspects Interdisciplinary communication about domain and technology
High-level orientation about project goals, dependencies, conflicts, trade-offs
Early identification of value and complexity drivers, risks, uncertainties
Matthias Book: HPC Systems Engineering in the Interaction Room 6
High Performance Computing / Scientific Computing
Simulation Science
Simulation of natural processes to learn about known behavior of a complex system, e.g. Weather forecast
Human brain
Glacial processes
Volcanic processes
Crowd behavior
etc.
(Focus of following IR ideas)
Data Science
Identification of patterns / correlations to learn about unknown aspects of a complex system, e.g. Recognizing customer preferences
Recognizing medieval manuscript scribes
etc.
HPC: Break simulation/data science problems down into small chunks for parallel processing on very large number of cores
Matthias Book: HPC Systems Engineering in the Interaction Room 7
Crucial Interdisciplinary Communication Pointsin HPC Simulation Science Projects
Domain experts need to help systems engineers understand: What research question are we trying to answer? What context, what boundary conditions?
What parameters and variables are there? How are they evolving over time? Initial values?
How do the variables affect each other? Is interaction long- or short-range?
What are particularly interesting segments of the simulation space? Are these variable?
etc.
Systems engineers need to validate technical decisions with domain experts: Cluster architecture: Memory-intensive or compute-intensive? Many-core, multi-core, GPUs?
Domain decomposition: How to map the problem most efficiently to the cluster?
Communication patterns: Choice of communication type? Ghosts and halos?
Memory model: Distributed (MPI), shared (OpenMP) or hybrid?
Data structures: What can be transient / must be persistent? Checkpointing? Parallel I/O?
etc.
Matthias Book: HPC Systems Engineering in the Interaction Room 8
Typical Pitfalls in HPC Simulation Science Projects
Choosing appropriate solvers vs. reinventing the wheel
Inefficient domain decomposition; load imbalance
Dealing with differences between & unique strengths of individual architectures
Dealing with different schedulers and their job scripts
Debugging costs high amount of (possibly expensive) computing time
Approximation of real world, insufficient validation data
Integrating different physical models/processes with each other (multi-physics)
Constant change of hardware, software, modus operandiConstant need for porting; always an early adopter; changing code ownership
Many of these revealed only in late (i.e. expensive to fix) project stages
Matthias Book: HPC Systems Engineering in the Interaction Room 9
Software Process for HPC Simulation Science Projects
1. Understand the problem domain
2. Perform appropriate domain decomposition andchoose appropriate communicators, helpful libraries, data structures etc.
3. Implement correct code framework for communication between processes;integrate correct problem-domain code into communication code
4. Test and validate simulation model
5. Optimize accuracy, tune performance
Matthias Book: HPC Systems Engineering in the Interaction Room 10
Conceptual Levels in HPC Simulation Science Projects
Problem level: Statement of research question / project goal and scope Goal, context, scope: Research question, boundary conditions, assumptions, abstractions
Quality requirements: Accuracy, generalizability, performance
Scientific level: Description of the pertinent aspects of the domain Static aspects: Coordinates, variables, sources of influence, points of interest, physical laws
Dynamic aspects: Forces, interactions, events, timing, discontinuities
Distribution level: Breakdown of the scientific model into parallelizable units Static aspects: Domain decomposition, data structure, initial conditions
Dynamic aspects: Communication patterns, stencils, halos, ghosts, adaptive mesh refinements, iterative numerical methods
Technical level: Implementation of distribution model on particular architecture Static aspects: Cluster architecture, (parallel) file system, memory model, interconnect
Dynamic aspects: Communication protocols, I/O operations, available libraries, solvers
Matthias Book: HPC Systems Engineering in the Interaction Room 11
Interaction Room Canvases for Simulation Science
Problem canvas: Goal and scope of research question about the domain
Real-world canvas: Description of the pertinent aspects of the domain
Decomposition canvas: Breakdown of scientific model into parallelizable units
Architecture canvas: Implementation of simulation on suitable HPC technology
Matthias Book: HPC Systems Engineering in the Interaction Room 12
Problem CanvasGoal and scope of research question about the domain
Domain experts collect on note cards: Research question
Boundary conditions
Assumptions
Abstractions
Quality requirements
Example: Heat dissipation problem Question: What will the temperature in the middle of a room be like after running an air
conditioner on one side and a heater on the other for several hours?
Boundary conditions: Room size, starting temperature, A/C and heater setting
Abstractions: Consider heat transfer by air flow / convection only, not by radiation
Assumptions: No moving objects in the room, no windows
Quality requirements: Temperature must be determined with double precision
Matthias Book: HPC Systems Engineering in the Interaction Room 13
Real-World CanvasDescription of the pertinent aspects of the domain
Domain experts sketch static properties of the simulation space Spatial setup
Locations and properties of simulation elements
Domain experts sketch dynamic properties of simulation process Forces
Events
Points of interest (actors, sensors)
Changes over time
Example: Heat dissipation problem Room geometry, placement of heater, A/C, monitor
Working of convection forces
Working of air flows, times of A/C operation
Appropriate formulae, numerical methods
Matthias Book: HPC Systems Engineering in the Interaction Room 14
Decomposition CanvasBreakdown of scientific model into parallelizable units
Technical experts sketch digital model reflecting real-world model, focusing on: Static aspects: Domain decomposition, data structure, abstraction level, initial conditions
Dynamic aspects: Communication patterns, stencils, halos/ghosts, adaptive mesh refinements, iterative numerical methods
Example: Heat dissipation problem Initial room temperature: 20°C, A/C set to 10°C
Resource requirements: e.g. regular grid,number of cores, memory usage
Adaptive mesh refinement
Cartesian communicator
Halo creation & communication strategy
Iterative method of known physical formula(heat transfer)
Matthias Book: HPC Systems Engineering in the Interaction Room 15
Architecture CanvasImplementation of simulation on suitable HPC technology
Technical experts sketch mapping of digital model to actual cluster infrastructure Static aspects: Cluster architecture, parallel file system, memory model, available modules,
tool support
Dynamic aspects: Communication protocols, I/O operations, solvers, scheduling, checkpointing, output format
Example: Heat dissipation problem 16 cores, I/O and compute nodes
MPI or hybrid, Intel compiler
Jacobi solver
10.000 iterations (based on previous experience)require ~3 h wall time
Checkpoints every 2.000 iterations, parallel I/O
Output format: List of temperature measurements
Matthias Book: HPC Systems Engineering in the Interaction Room 16
Interaction Room Annotations for Simulation Science
Highlight model elements that merit particular consideration: Value annotations
Scientific value
Risk annotations
Complexity
Innovation
Uncertainty
Effort annotations
Quality requirements
Boundary conditions
Interfaces
Shift attention from what is visible in models to what is implied, what is assumed, what is unknown
i.e. those aspects that often make or break a project
Matthias Book: HPC Systems Engineering in the Interaction Room 17
Interaction Room Workflow
1. Define project scopeon problem canvas
2. For real-world, decomposition andarchitecture canvas, iteratively:1. Sketch static & dynamic canvas concepts
2. Place and discuss canvas annotations
3. Place Uncertainty annotations
4. Refine prior canvases with new insights
3. Identify need for formal specifications
4. Make project plan e.g. agile backlog or classic work packages
Matthias Book: HPC Systems Engineering in the Interaction Room 18
Outlook on Further Work
More precise definition of canvases and notations
Choice of appropriate annotations – identification of particular challenges
Refinement of process
Validation in actual projects
Matthias Book: HPC Systems Engineering in the Interaction Room 19