Upload
ashlee-johnston
View
221
Download
2
Tags:
Embed Size (px)
Citation preview
Astrophysics, Biology, Climate, Combustion, Fusion, HEP, Nanoscience
Sim Scientist
DOE NL
5/24/2004 Chicago Meeting DOE Data Management 2
Workflows
• Critical need: Enable (and Automate) Scientific Work Flows– Data Generation. – Data Storage– Data Transfer– Data Analysis– Visualization
• An order of magnitude more effort can be spent on manually managing these work flows than on performing the simulation itself.
• Workflows are not static.
5/24/2004 Chicago Meeting DOE Data Management 3
Simulations• Simulations run in batch mode.• Remaining workflow interactive or “on demand.”• Simulation and analyses performed by distributed teams
of research scientists.– Need to access remote and distributed data,
resources.– Need for distributed collaborative environments.
• We will not present solutions in this talk!– Some solutions will be problem dependent.
• Example: Remote Viz. vs. Local Viz., Parallel HDF5 vs. Parallel netcdf, …
5/24/2004 Chicago Meeting DOE Data Management 4
How do we do simulation science (I)
• Let’s suppose that we have a verified HPC code.– I will use the Gyrokinetic Toroidal Code (GTC) to
serve as an example.• We also suppose that we have a suite of analysis and
visualization programs.• We want to eventually compare the output of this to
theoretical and/or experimental and/or other simulation results.
5/24/2004 Chicago Meeting DOE Data Management 5
A fast peek at the workflow
Thought
HPC
Computevolume average
Compute tracer particle energy,
positionmomentum
Compute 1d and 2d radial and
velocity profiles
Computecorrelationfunctions
Feature tracking of the
heat potential
Thought
VIZ
VIZ
VIZ
VIZ
VIZ
VIZ
Global Analysis
tools
VIZ
TB’s
viz features metadata
movies
paper
Let’s go through the scientific process
requirements:1TB/sim now: 10TB/year100TB/sim 5yr: .5PB/year58Mbs now, 1.6Gbs 5 yr
Data TransferData GenerationData AnalysisData VisualizationData Storage
5/24/2004 Chicago Meeting DOE Data Management 6
Stage 1: Initial Question + Thought
• Scientist thinks of a problem to answer a physical question.
• Example:– What saturates transport driven
by Ion Temperature Gradient?
• Requirements:– Possible changes in the code.– New visualization routines to
examine particles.– New modifications in analysis
tools.
Question
Thought
Time
thought
Question
Collaborate withO(5) people: faceto face, phone.
5/24/2004 Chicago Meeting DOE Data Management 7
Stage 2: Change code add analysis• If
– Code is mature, go to stage 4.
• Else– Scientists modify HPC code to
put in new routines for new physics, new capabilities.
– Scientists change the code to answer the question.
– If necessary, analysis/viz routines are added/modified
– where do the inputs come from?• experiments, other sims,
theory.
HPC
Thought
weeks
Time
thought
QuestionCodemodifications
Total output = 1TB/full run40 hours= 58Mbs: now
5 years: 0.1PB/hero run150 hours= 1.6Gbs
O(5) people modify code
Code input Code input
computation
I/O I/O
Runtime
1TS
5/24/2004 Chicago Meeting DOE Data Management 8
Stage 3: Debugging Stage• Scientists modify HPC code to
put in new routines for new physics
• Scientist generally run a parameter survey to answer the question(s).
• Scientist change the code to answer the question.
• 1 to 2 people debug the code.• Verify code again, regression
test.
HPC
Thought
weeks…
TimeQuestion
Codemodifications
Total output = 0.1Mbs
Thought
Compute volume average
Continue Run sequence
thought
VIZ
results are thrown away
5/24/2004 Chicago Meeting DOE Data Management 9
Stage 4: Run production code.• Now the scientist has confidence in
the modifications.• Scientist generally run a parameter
survey and/or sensitivity analysis to answer the question(s).
• Scientist need good analysis and visualization routines.
• O(3) look at raw data and run analysis programs.
– Filter data
– Look for features for the larger group.
• O(10) look at end viz. and interpret the results.
TimeQuestion
0.01Mbs
Particles50Mbs
Production run
Interpret resultsthought
Thought
HPC
Compute volume averageCompute tracer particle energy,
position, momentum
Compute 1d and 2d radial and
velocity profiles
VIZ VIZ
VIZ
VIZscalar 60 Mbs
.5% TS
1000TS
data can flow from RAM to RAM/disk/WAN/LAN.
5/24/2004 Chicago Meeting DOE Data Management 10
Stage 4a: Data Management Observations.
• We must understand1. Data Generation from simulation and
analysis routines.2. Size of Data being generated.
– Latency issues for access patterns.
– Can we develop good compression techniques?
– Bandwidth/disk speed issues.– Do we need non-volatile
storage? RAM-RAM, RAM – Disk-tape
– “Plug and play” analysis routines, need a common data model
– non-trivial to transfer from N processors to M processors!
– Bottleneck analysis is too slow.
Time
Codemodifications
thought
Thought
HPC VIZ
VIZ
VIZVIZ
particles 50Mbs
0.01Mbs
scalar 60 Mbs
.5% TS
1000TS
•Save scalar data for more post-processing.•Save Viz data•Toss particle Data
Particles50Mbs
5/24/2004 Chicago Meeting DOE Data Management 11
Stage 5: Feedback Stage• After the production run we
interpret the results• We then ask a series of
questions:– Do I have adequate analysis
routines?
– Was the original hypothesis correct?
– Should the model equations change?
– Do we need to modify it?
• If everything is ok, should we continue the parameter survey?
Time
Production run
Interpret results…
Thought
HPC
Computecorrelation
function
Thought
VIZ
VIZ
VIZ
VIZ
VIZ
The workflow is changing!
comparison to other data, theory, sim., experiments
5/24/2004 Chicago Meeting DOE Data Management 12
Stage 5: Observations• To expedite this process
– Need standard data model(s).– Can we build analysis routines which can be used for multiple codes
and or multiple disciplines??
• Data Model must allow flexibility.– Commonly we add/remove variables used in the simulations/analysis
routines.– Need for metadata, annotation, and provenance:
• Nature of Metadata– Code versions, compiler information, machine configuration.– Simulation parameters, model parameters.– Information on simulation inputs.
– Need for tools to record provenance in databases.• Additional provenance (above that provided by the above metadata)
needed to describe:– Reliability of data; how the data arrived in the form in which it was
accessed; data ownership.
Production run
Interpret results…
5/24/2004 Chicago Meeting DOE Data Management 13
Stage 5: Observations• Data Analysis routines can include
– Data Transformation• Format transformation• Reduction• Coordinate transformation• Unit transformation• Creation of derived data• …
– Feature detection, extraction, tracking• Define metadata• Find regions of interest• Perform level set analyses in spacetime• Perform born analyses.
– Inverse feature tracking
– Statistical Analysis: PCA, Comparative Component Analyses, data fitting, correlations
Time
Production run
Interpret results…
Thought
HPC
Thought
VIZ
VIZ
VIZ
VIZ
VIZ
5/24/2004 Chicago Meeting DOE Data Management 14
Stage 5: Observations• Visualization Needs
– Local, Remote, Interactive, Collaborative, Quantitative, Comparative
– Platforms
– Fusion of different data types• Experimental, Theoretical, Computational,…• New representations
Time
Production run
Interpret results…
Thought
HPC
Thought
VIZ
VIZ
VIZ
VIZ
VIZ
5/24/2004 Chicago Meeting DOE Data Management 15
Stage 6: Complete parameter survey
• Complete all of the runs for the parameter survey to answer the question.
• 1 – 3 are looking at the results during the parameter survey.
Time
Production run
Interpret results Production run
Interpret results …Thought
HPC
Featuretracking
Thought
VIZ
VIZ
VIZ
VIZ
VIZ
VIZ
5/24/2004 Chicago Meeting DOE Data Management 16
Stage 7: Run a “large” Hero run• Now we can run a high
resolution case, which will run for a very long time.
• O(10) are looking at the results.
Time
LARGE Hero run, Interpret results…Thought
HPC
Thought
VIZ
VIZ
VIZ
VIZ
VIZ
VIZ
5/24/2004 Chicago Meeting DOE Data Management 17
Stage 8: Assimilate the results.• Did I answer the question?
– Yes• Now publish a paper.• O(10+) look at results.• Compare to experiment
– Details here.
• What do we need stored?– Short term storage
– Long term storage
– NO• Go back to Stage 1:
Question
Time
Interpret results
TB’s viz features metadata movies
Data repository
Global Analysis
tools
…
VIZ
Data Miningtools
assimilate results
5/24/2004 Chicago Meeting DOE Data Management 18
Stage 9: Other scientist use information
• Now other scientist can look at this information and use it for their analysis, or input for their simulation.
• What is the data access patterns– Global Interactive VIZ: GB’s of
data/time slice, TB’s in the future.
– Bulk data is accessed numerous times.
– Look at derived quantities. MB’s to GB’s of data.
• How long do we keep the data?– Generally less than 5 years.
Time
Interpret results…
Data repository
Global Analysis
tools
VIZ
TB’s viz features metadata movies
5/24/2004 Chicago Meeting DOE Data Management 19
Let Thought be the bottleneck • Simulation Scientists generally have scripts to semi-
automate parts of the workflow.• To expedite this process they need to
– Automate the workflow as much as possible.– Remove the bottlenecks
• Better visualization, better data analysis routines, will allow users to decrease the interpretation time.
• Better routines to “find the needle in the haystack” will allow the thought process to be decreased: Feature detection/tracking
• Faster turn around time for simulations will decrease the code runtimes.
– Better numerical algorithms, more scalable algorithms.– Faster processors, faster networking, faster I/O.– More HPC systems, more end stations.
5/24/2004 Chicago Meeting DOE Data Management 20
Summary:
• Biggest bottleneck: Interpretation of Results.– This is the biggest bottleneck because
• Babysitting– Scientists spend their “real-time” babysitting
computational experiments. [trying to interpret results, move data, and orchestrate the computational pipeline].
– Deciding if the analysis routines are working properly with this “new” data.
• Non-scalable data analysis routines– Looking for the “needle in the haystack”.– Better analysis routines could mean less time in the
thought process and in the interpretation of the results.
• The entire scientific process can not be fully automated.
5/24/2004 Chicago Meeting DOE Data Management 21
Workflows• No changes in these workflows.
5/24/2004 Chicago Meeting DOE Data Management 22
Section 3: Astrophysical Simulation Workflow Cycle
Parallel HDF5
Run Simulationbatch job on capability
system
HPSS
Archivecheckpoint
filesto HPSS
Simulationgenerates
checkpointfiles
MSS, Disks, & OS
Migrate subset of checkpointfiles to local
cluster
ApplicationLayer
GPFS PVFSor
LUSTRE
Vis & Analysison local
Beowulf cluster
ContinueSimulation?
Start NewSimulation?
StorageLayer
ParallelI/O Layer
5/24/2004 Chicago Meeting DOE Data Management 23
Biomolecular Simulation
Molecular System
Construction
StatisticalAnalysis
StructureDatabase(e.g. PDB)
Parameterization
Hardware, OS, Math Libraries, MSS (HPSS)
MolecularTrajectories
StorageManagement,
Data MovementAnd Access
Workflow
DesignMolecular
System
Analysis&
Visualization
ComputerSimulation
ArchiveTrajectories
Review/Curation
Trajectory Database Server(e.g.BioSimGrid)
Large Scale Temporary Storage
Raw Data
Visualization
5/24/2004 Chicago Meeting DOE Data Management 24
Combustion Workflow
5/24/2004 Chicago Meeting DOE Data Management 25
GTC Workflow
Deposit the charge of very particle on the grid
Solve Poisson equation to get the potential on the grid
Calculate the electric field
Gather the forces from the grid to the particles and push them
Do process migration with the particles that have moved out of their current domain
GTC Compute volumeaveragedquantities
Compute tracerParticle
Energy, positionmomentum
Compute 1dand 2d radial and velocity profiles
viz
viz
viz
viz
viz
analysis
ComputeCorrelation functions
5/24/2004 Chicago Meeting DOE Data Management 26
NIMROD Workflow
nimrod.inNIMROD
Run-timeConfig
nimhdf, nimfl, nimplot, …
Run-timeConfig
Phi.h5nimfl.bin
XdrawAVS/Express
SCIRunOpenDX
AnimationsAnimationsAnimations
ImagesImagesImagesImages
Screen
nimset dump.00000
Inputfiles
fluxgrid.in
nimhdf.innimfl,.in
…
discharge energy nimhist
data for every time step
dump.*
Restart file
~10
0 fi
les
5/24/2004 Chicago Meeting DOE Data Management 27
Initial Run
VMEC, JSOLVER, EFIT, etc
M3D Simulation Studies 2009 (rough estimate)
Restart 1 Restart 2 Restart N
HPSS (NERSC)
PPPL Local Project Disks
Done
Run M3D at NERSC on 10,000 processors for 20 hours per segment
Post-process locally on PPPL upgraded cluster. Requires 10 min per time slice to analyze. Typically analyze 20 time slices.
1 TB files, transfer time 10 min, if parallel?
5/24/2004 Chicago Meeting DOE Data Management 28
A Simplified VORPAL Workflow
Initial Parameters
InputData
InputData
InputData
VORPAL
Run-time Configurations
D1 DnD2 D3
Data Filtering/Extraction
D1 DnD2 D3
Image Generator (Xdraw)
png1 pngnpng2 png3
Time slices
Sim1Animation
Sim2Animation
SimXAnimation
Currently, the workflow is handled by a set of scripts. Data movement is handled either by scripts or manually.
5/24/2004 Chicago Meeting DOE Data Management 29
TRANSP Workflow
Preliminary dataAnalysis andPreparation
(largely automated)
DiagnosticHardware
Experiments (CMod, DIII-D, JET, MAST, NSTX)
20-50 signals {f(t), f(x,t)}Plasma position, Shape,Temperatures, DensitiesField, Current, RF andBeam Injected Powers.
TRANSP Analysis*:Current diffusion, MHD equilibrium, fast ions,
thermal plasma heating; power, particle and
momentum balance.
TRANSP Analysis*:Current diffusion, MHD equilibrium, fast ions,
thermal plasma heating; power, particle and
momentum balance.
Experiment simulationOutput Database
~1000-2000 signals{f(t), f(x,t)}
Visualization
Load RelationalDatabases
Detailed (3d) time-slice physics simulations (GS2, ORBIT, M3D…)
Pre- and Post-processing at the experimental site…
D. McCune 23 Apr 2004
5/24/2004 Chicago Meeting DOE Data Management 30
Workflow for Pellet Injection Simulations
Preliminary Analysis (Deciding run parameters)
Run 1D Pellet code Table of energy sinkterm as a function offlux-surface and time
Run AMR Production Code
HDF5 data filesRun post-processing codeto compute visualizationvariables and other diagnosticquantities (e.g. total energy) for plotting
Visualize field quantitiesin computational spaceusing ChomboVis
Create diagnostic plots
Interpolate solution on finest mesh. Create data files for plotting field quantities in a torus
Input Files
HDF5 data files of plotting variables
ASCII files of diagnostic variables
Interpolated data files (binary)Visualize field quantitiesin a torus using AVS or ensight
Majority of Time
5/24/2004 Chicago Meeting DOE Data Management 31
Degas2 Workflow
5/24/2004 Chicago Meeting DOE Data Management 32
High-Energy Physics WorkflowTypical of a major collaboration
SIMULATION
Users: Simulation Team
At: 10s of sites
DATA ACQUISITION
Users: DAQ team
At: 1 site
DATABASES:
< 1 terabyteConditions,Metadata
AndWorkflow
RECONSTRUCTION (Feature
Extractions)
Users: Reconstruction Team
At: few sites
SKIMMING/FILTERING
Users: Skim Team
At: few sites
ANALYSIS
Users: All Physicists
At: 100+ Sites
100s of terabytes today
10s of petabytes in 2010
5/24/2004 Chicago Meeting DOE Data Management 33
Nuclear Physics WorkflowTypical of a major collaboration
SIMULATION
Users: Simulation Team
At: 10s of sites
DATA ACQUISITION
Users: DAQ team
At: 1 site
DATABASES:
< 1 terabyteConditions,Metadata
AndWorkflow
RECONSTRUCTION (Feature
Extractions)
Users: Reconstruction Team
At: few sites
SKIMMING/FILTERING
Users: Skim Team
At: few sites
ANALYSIS
Users: All Physicists
At: 100+ Sites
100s of terabytes today
10s of petabytes in 2010
5/24/2004 Chicago Meeting DOE Data Management 34
Comments from others