Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
Job-based I/O-monitoring with LLview
Wolfgang [email protected]ülich Supercomputing Centre
Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017
I/O Monitoring with LLview & GPFS
I/O Workloads on JURECA
Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 2
I/O Usage Monitoring and I/O Statistics co
arse
gra
ined
fine
grai
ned Network activity
monitoring
File statistics file sizes / file types /directory stats
GPFS policy engineupdate freq.: 1/month
FS parameter planning (blocksize, ...)
Node-level I/O monitoring
bytes written/read, file open/close
GPFS mmpmonupdate freq.: 1/min
B/W planning, I/O traffic/ congestion monitoringApplication I/O monitoring (coarse)
Application I/O analysis
appl. I/O summaryI/O-calls, parallelism
darshan (lib.,wrapper)
score-p (instr.)
Application I/O monitoring (fine grained)Application activity & I/O
file system usage: quota (q_dataquota)
GPFS quotasupdate freq.: 1/4h (disk)
1/month (tape)
FS Planning (#disks, file systems types, …)
Disk usage
SIOX
Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 3
GPFS mmpmon Data Workflow in LLview
compute nodeGPFS Client
compute nodeGPFS Client
compute nodeGPFS Client
…
mmpmonlogfiles
…
timestamp1 <fs> <bytes_wr> \<bytes_rd> <op_open> <op_close> …
timestamp2 <fs> ……
1/min
1/min
1/min
LLview: adapterfor GPFS
mmpmon metrics
Scheduler (Slurm)
node / job info
LLview: adapterfor scheduler LML
…0.5-1/min
LLviewworkflow
LML LLview usage SQL database(sqlite3)
LML_usage moduleLLviewclientsLML
…
Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 4
LLview: User-level Monitoring Efficient supervision of node usage,
running jobs, statistics, history Monitoring of energy consumption,
load, memory usage, I/O usage, GPU usage
Interactive and mouse-sensitive Main data source: batch scheduler,
runtime system Web-based HTML5 client (in preparation)
Fully customizable, fast and portable client-server application
Integrated into Eclipse/PTP Support for various resource
manager, incl. LoadLeveler, IBM Blue Gene, Cray ALPS, PBSpro, Torque, SLURM, Grid Engine and LSF LLview download: (open source)
http://www.fz-juelich.de/jsc/llview
Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 5
LLview: Node & File System Metrics
Job-based metrics: bandwidth (last minute) bytes written/read since job start open/close operations
Node-based mapping ofI/O load (color-coded)
Job-based history (I/O, Load, Memory, …)
System-based history (I/O, Load, Memory, …)
Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 6
GPFS mmpmon Data Workflow in LLview
compute nodeGPFS Client
compute nodeGPFS Client
compute nodeGPFS Client
…
mmpmonlogfiles
…
timestamp1 <fs> <bytes_wr> \<bytes_rd> <op_open> <op_close> …
timestamp2 <fs> ……
1/min
1/min
1/min
LLview: adapterfor GPFS
mmpmon metrics
Scheduler (Slurm)
node / job info
LLview: adapterfor scheduler LML
…0.5-1/min
LLviewworkflow
LML LLview usage SQL database(sqlite3)
LML_usage moduleLLviewclientsLML
…
Project / User database(Oracle SQL)
On-/Off-lineanalysis
LLview jobreports<1/min
PDFsHTML5
Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 7
LLview: Job reports Detailed reports on active and finished jobs Job summary and time-based diagrams Use cases: development, production
checks, job health check, light-weightperformance analysis
Metrics per Node/GPU: CPU load (5-min avg.) Interconnect (MB/s in/out, packets in/out) Memory usage (GiB) I/O (per FS: GiB, GiB/s write/read, #Open/Close) GPU (Usage, Memory, Power, Temperature, …)
Data Sources: SLURM, GPFS, IB, GPU Lists for active and finished jobs (< 2 weeks) Updates after 1-5 minutes Colormap for quick evaluation of metrics Link to PDF: detailed job report
Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 8
GPFS mmpmon Data Workflow in LLview
compute nodeGPFS Client
compute nodeGPFS Client
compute nodeGPFS Client
…
mmpmonlogfiles
…
timestamp1 <fs> <bytes_wr> \<bytes_rd> <op_open> <op_close> …
timestamp2 <fs> ……
1/min
1/min
1/min
LLview: adapterfor GPFS
mmpmon metrics
Scheduler (Slurm)
node / job info
LLview: adapterfor scheduler LML
…0.5-1/min
LLviewworkflow
LML LLview usage SQL database(sqlite3)
LML_usage moduleLLviewclientsLML
…
Project / User database(Oracle SQL)
On-/Off-lineanalysis
LLview jobreportsstatistics<1/min
PDFsHTML5
Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 9
I/O Workload: Jureca
read ≈ write
write >> read
read >> write
total # bytes read
total # bytes written
Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 10
I/O Workload: Jureca by Topic (WORK)Particle Physics Climate, Earth, Environment CFD Plasma Physics
Misc. Engineering Computation Biology,Biophysics, Biochemistry
Material Science Astrophysics
Polymer/Soft Matter Physics Condensed Matter Physics ChemistryNuclear/Atomic Physics
…
Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 11
I/O Workload: Jureca by Group
…
group 1 group 2 group 3
group 4 group 5 group 6 group 7
group 8 group 9 group 11group 10
Climate, Earth, EnvironmentClimate, Earth, Environment
Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 12
Conclusion & Outlook
I/O Workload statistics and analysis on different levelsConcepts for analysis of job-based mmpmon data;
On-line and off-line analysisClassification (e.g. Research topic, Group, Account)
User Access to I/O statistics I/O Reports access within user portalAutomatic job reports after job end
Next steps Automatic Identification of non-well behaving jobs Additional (derived) metrics
Temporal locality: Continuous, Burst, … Spatial locality: Parallel I/O schemes
Web interface: statistics, interactive graphs