12
Mitglied der Helmholtz-Gemeinschaft Job-based I/O-monitoring with LLview Wolfgang Frings [email protected] Jülich Supercomputing Centre Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 I/O Monitoring with LLview & GPFS I/O Workloads on JURECA

Job-based I/O-monitoring with LLviewAnalyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 12 Conclusion & Outlook I/O Workload

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Job-based I/O-monitoring with LLviewAnalyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 12 Conclusion & Outlook I/O Workload

Mitg

lied

der H

elm

holtz

-Gem

eins

chaf

t

Job-based I/O-monitoring with LLview

Wolfgang [email protected]ülich Supercomputing Centre

Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017

I/O Monitoring with LLview & GPFS

I/O Workloads on JURECA

Page 2: Job-based I/O-monitoring with LLviewAnalyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 12 Conclusion & Outlook I/O Workload

Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 2

I/O Usage Monitoring and I/O Statistics co

arse

gra

ined

fine

grai

ned Network activity

monitoring

File statistics file sizes / file types /directory stats

GPFS policy engineupdate freq.: 1/month

FS parameter planning (blocksize, ...)

Node-level I/O monitoring

bytes written/read, file open/close

GPFS mmpmonupdate freq.: 1/min

B/W planning, I/O traffic/ congestion monitoringApplication I/O monitoring (coarse)

Application I/O analysis

appl. I/O summaryI/O-calls, parallelism

darshan (lib.,wrapper)

score-p (instr.)

Application I/O monitoring (fine grained)Application activity & I/O

file system usage: quota (q_dataquota)

GPFS quotasupdate freq.: 1/4h (disk)

1/month (tape)

FS Planning (#disks, file systems types, …)

Disk usage

SIOX

Page 3: Job-based I/O-monitoring with LLviewAnalyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 12 Conclusion & Outlook I/O Workload

Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 3

GPFS mmpmon Data Workflow in LLview

compute nodeGPFS Client

compute nodeGPFS Client

compute nodeGPFS Client

mmpmonlogfiles

timestamp1 <fs> <bytes_wr> \<bytes_rd> <op_open> <op_close> …

timestamp2 <fs> ……

1/min

1/min

1/min

LLview: adapterfor GPFS

mmpmon metrics

Scheduler (Slurm)

node / job info

LLview: adapterfor scheduler LML

…0.5-1/min

LLviewworkflow

LML LLview usage SQL database(sqlite3)

LML_usage moduleLLviewclientsLML

Page 4: Job-based I/O-monitoring with LLviewAnalyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 12 Conclusion & Outlook I/O Workload

Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 4

LLview: User-level Monitoring Efficient supervision of node usage,

running jobs, statistics, history Monitoring of energy consumption,

load, memory usage, I/O usage, GPU usage

Interactive and mouse-sensitive Main data source: batch scheduler,

runtime system Web-based HTML5 client (in preparation)

Fully customizable, fast and portable client-server application

Integrated into Eclipse/PTP Support for various resource

manager, incl. LoadLeveler, IBM Blue Gene, Cray ALPS, PBSpro, Torque, SLURM, Grid Engine and LSF LLview download: (open source)

http://www.fz-juelich.de/jsc/llview

Page 5: Job-based I/O-monitoring with LLviewAnalyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 12 Conclusion & Outlook I/O Workload

Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 5

LLview: Node & File System Metrics

Job-based metrics: bandwidth (last minute) bytes written/read since job start open/close operations

Node-based mapping ofI/O load (color-coded)

Job-based history (I/O, Load, Memory, …)

System-based history (I/O, Load, Memory, …)

Page 6: Job-based I/O-monitoring with LLviewAnalyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 12 Conclusion & Outlook I/O Workload

Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 6

GPFS mmpmon Data Workflow in LLview

compute nodeGPFS Client

compute nodeGPFS Client

compute nodeGPFS Client

mmpmonlogfiles

timestamp1 <fs> <bytes_wr> \<bytes_rd> <op_open> <op_close> …

timestamp2 <fs> ……

1/min

1/min

1/min

LLview: adapterfor GPFS

mmpmon metrics

Scheduler (Slurm)

node / job info

LLview: adapterfor scheduler LML

…0.5-1/min

LLviewworkflow

LML LLview usage SQL database(sqlite3)

LML_usage moduleLLviewclientsLML

Project / User database(Oracle SQL)

On-/Off-lineanalysis

LLview jobreports<1/min

PDFsHTML5

Page 7: Job-based I/O-monitoring with LLviewAnalyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 12 Conclusion & Outlook I/O Workload

Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 7

LLview: Job reports Detailed reports on active and finished jobs Job summary and time-based diagrams Use cases: development, production

checks, job health check, light-weightperformance analysis

Metrics per Node/GPU: CPU load (5-min avg.) Interconnect (MB/s in/out, packets in/out) Memory usage (GiB) I/O (per FS: GiB, GiB/s write/read, #Open/Close) GPU (Usage, Memory, Power, Temperature, …)

Data Sources: SLURM, GPFS, IB, GPU Lists for active and finished jobs (< 2 weeks) Updates after 1-5 minutes Colormap for quick evaluation of metrics Link to PDF: detailed job report

Page 8: Job-based I/O-monitoring with LLviewAnalyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 12 Conclusion & Outlook I/O Workload

Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 8

GPFS mmpmon Data Workflow in LLview

compute nodeGPFS Client

compute nodeGPFS Client

compute nodeGPFS Client

mmpmonlogfiles

timestamp1 <fs> <bytes_wr> \<bytes_rd> <op_open> <op_close> …

timestamp2 <fs> ……

1/min

1/min

1/min

LLview: adapterfor GPFS

mmpmon metrics

Scheduler (Slurm)

node / job info

LLview: adapterfor scheduler LML

…0.5-1/min

LLviewworkflow

LML LLview usage SQL database(sqlite3)

LML_usage moduleLLviewclientsLML

Project / User database(Oracle SQL)

On-/Off-lineanalysis

LLview jobreportsstatistics<1/min

PDFsHTML5

Page 9: Job-based I/O-monitoring with LLviewAnalyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 12 Conclusion & Outlook I/O Workload

Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 9

I/O Workload: Jureca

read ≈ write

write >> read

read >> write

total # bytes read

total # bytes written

Page 10: Job-based I/O-monitoring with LLviewAnalyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 12 Conclusion & Outlook I/O Workload

Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 10

I/O Workload: Jureca by Topic (WORK)Particle Physics Climate, Earth, Environment CFD Plasma Physics

Misc. Engineering Computation Biology,Biophysics, Biochemistry

Material Science Astrophysics

Polymer/Soft Matter Physics Condensed Matter Physics ChemistryNuclear/Atomic Physics

Page 11: Job-based I/O-monitoring with LLviewAnalyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 12 Conclusion & Outlook I/O Workload

Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 11

I/O Workload: Jureca by Group

group 1 group 2 group 3

group 4 group 5 group 6 group 7

group 8 group 9 group 11group 10

Climate, Earth, EnvironmentClimate, Earth, Environment

Page 12: Job-based I/O-monitoring with LLviewAnalyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 12 Conclusion & Outlook I/O Workload

Analyzing Parallel I/O, BOF, SC17, Denver, Nov 16th 2017 LLview GPFS I/O Monitoring & Statistics, W.Frings 12

Conclusion & Outlook

I/O Workload statistics and analysis on different levelsConcepts for analysis of job-based mmpmon data;

On-line and off-line analysisClassification (e.g. Research topic, Group, Account)

User Access to I/O statistics I/O Reports access within user portalAutomatic job reports after job end

Next steps Automatic Identification of non-well behaving jobs Additional (derived) metrics

Temporal locality: Continuous, Burst, … Spatial locality: Parallel I/O schemes

Web interface: statistics, interactive graphs