Upload
rowan-hodges
View
26
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Kernel-level Measurement for Integrated Parallel Performance Views. A. Nataraj, A. Malony, S. Shende, A. Morris {anataraj,malony,sameer,amorris}@cs.uoregon.edu Performance Research Lab University of Oregon. Outline. Motivations Objectives ZeptoOS project KTAU Architecture - PowerPoint PPT Presentation
Citation preview
Kernel-level Measurement for Integrated Parallel Performance
Views
A. Nataraj, A. Malony, S. Shende, A. Morris{anataraj,malony,sameer,amorris}@cs.uoregon.edu
Performance Research LabUniversity of Oregon
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 2
Outline
Motivations Objectives ZeptoOS project KTAU Architecture Case Studies - the Performance Views Perturbation Study KTAU improvements Future work Acknowledgements
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 3
Motivation
Application performance is a consequence of User-level execution OS-level operation
Good tools exist for observing user-level performance User-level events Communication events Execution time measures Hardware performance
Fewer tools exist to observe OS-level aspects Ideally would like to do both simultaneously
OS-level influences on application performance
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 4
Scale and Performance Sensitivity
HPC systems continue to scale to larger processor counts Application performance more performance sensitive OS factors can lead to performance bottlenecks
[Petrini’03, Jones’03, …] System/application performance effects are complex Isolating system-level factors is non-trivial
Require comprehensive performance understanding Observation of all performance factors Relative contributions and interrelationship Can we correlate OS and application performance?
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 5
Phase Performance Effects
Waiting timedue to OS
Overheadaccumulates
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 6
Direct applications invoke the OS
for certain services syscalls and internal OS
routines called from syscalls
Indirect OS operations w/o explicit
invocation by application preemptive scheduling (HW) interrupt handling OS-background activity can occur at any time
Program - OS Interactions
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 7
Program - OS Interactions (continued)
Direct interactions easier to handle Synchronous with user-code In process-context
Indirect interactions more difficult Usually asynchronous Usually in interrupt-context Harder to measure
where are the boundaries? Harder to correlate and integrate with application
measurements
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 8
Performance Perspectives
Kernel-wide Aggregate kernel activity of all active processes Understand overall OS behavior Identify and remove kernel hot spots Cannot show application-specific OS actions
Process-centric OS performance in specific application context Virtualization and mapping performance to process Programs, daemons, and system services interactions Expose sources of performance problems Tune OS for specific workload and application for OS
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 9
Existing Approaches
User-space only measurement tools Many tools only work at user-level Cannot observe system-level performance influences
Kernel-level only measurement tools Most only provide the kernel-wide perspective
lack proper mapping/virtualization Some provide process-centric views
cannot integrate OS and user-level measurements
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 10
Existing Approaches (continued)
Combined or integrated user/kernel measurement tools A few tools allow fine-grained measurement Can correlate kernel and user-level performance Typically focus only on direct OS interactions Indirect interactions not normally merged Do not explicitly recognize parallel workloads
MPI ranks, OpenMP threads, …
Need an integrated approach to parallel performance observation and analyses that support both perspectives
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 11
High-Level Objectives
Support low-overhead OS performance measurement at multiple levels of function and detail
Provide both kernel-wide and process-centric perspectives of OS performance
Merge user-level and kernel-level performance information across all program-OS interactions
Provide online information and the ability to function without a daemon where possible
Support both profiling and tracing for kernel-wide and process-centric views in parallel systems
Leverage existing parallel performance analysis tools Support for observing, collecting and analyzing parallel data
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 12
ZeptoOS
DOE OS/RTS for Extreme Scale Scientific Computation Effective OS/Runtime for petascale systems Funded ZeptoOS project
Argonne National Lab and University of Oregon
What are the fundamental limits and advanced designs required for petascale Operating System Suites? Behaviour at large scales Management and optimization of OS suites Collective operations Fault tolerance OS performance analysis
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 13
ZeptoOS and TAU/KTAU
Lots of fine-grained OS measurement is required for each component of the ZeptoOS work
How and why do the various OS source and configuration changes affect parallel applications?
How do we correlate performance data between OS components Parallel application and OS
Solution: TAU/KTAU An integrated methodology and framework to
measure performance of applications and OS kernel
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 14
ZeptoOS Strategy
“Small Linux on big computers” IBM BG/L and other systems (e.g., Cray XT3)
Argonne Modified Linux on BG/L I/O nodes (ION) Modified Linux for BG/L compute nodes (TBD) Specialized I/O daemon on I/O node (ZOID) (TBD)
Oregon KTAU
integration of TAU infrastructure in Linux Kernel integration with ZeptoOS and installation on BG/L ION port to other 32-bit and 64-bit Linux platforms
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 15
KTAU Architecture
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 16
Controlled Experiments Controlled Experiments
Exercise kernel in controlled fashion Check if KTAU produces the expected correct and meaningful views
Test machines Neutron: 4-CPU Intel P3 Xeon 550MHz, 1GB RAM, Linux 2.6.14.3(ktau) Neuronic: 16-node 2-CPU Intel P4 Xeon 2.8GHz, 2GB RAM/node,
Redhat Enterprise Linux 2.4(ktau) Benchmarks
NPB LU, SP applications [NPB]
Simulated computational fluid dynamics (CFD) applications. A regular-sparse, block lower and upper triangular system solution.
LMBENCH [LMBENCH]
Suite of micro-benchmarks exercising Linux kernel Scenarios - Interrupt load, Scheduling load, Exceptions
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 17
Approx. 25 secs dilation in Total Inclusive time.
Why?Approx. 16 secs dilation in MPSP() Exclusive time.
Why?
User-level profile does not tell the whole story!
Controlled Experiments
Observing Interrupts
User-level Inclusive Time User-level Exclusive Time
Benchmark: NPB-SP application on 16 nodes
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 18
Controlled Experiments
Observing Interrupts
User+OS Inclusive Time User+OS Exclusive Time
Kernel-Space Time Taken by:
1.do_softirq()
2. schedule()
3. do_IRQ()
4. sys_poll()
5. icmp_rcv()
6. icmp_reply()
MPSP excl. time difference only 4 secs.
Excl-time view clearly identifies the culprits.
1. schedule()
2. do_IRQ()
3. icmp_reply()
4. do_softirq()
Pings cause interrupts (do_IRQ). Which in turn handled after interrupt by soft-interrupts (do_softirq). Actual routine is icmp_reply/rcv.
Large number of softirqs causes ksoftirqd to be scheduled-in, causing SP to be scheduled-out.
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 19
Controlled Experiments
Observing Scheduling NPB LU application on 8 CPU (neuronic.nic.uoregon.edu) Simulate daemon interference using “toy” daemon Daemon periodically wakes-up and performs activity What does KTAU show - different views…
A: Aggregated Kernel-wide View (Each row is single host)
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 20
Controlled Experiments
Observing Scheduling
B: Process-level View
(Each row is single process on host 8)
Select node 8 and take a closer look …
2 NPB LU processes
‘Toy’ Daemon activity
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 21
Controlled Experiments
Observing Scheduling
C: Voluntary / Involuntary Scheduling (Each row is single MPI rank)
Instrumentation to differentiate voluntary/involuntary schedule Experiment re-run on 4-processor SMP Local slowdown - preemptive scheduling Remote slowdown - voluntary scheduling (waiting!)
Pre-empted out by ‘Toy’ daemon
Other 3 yield cpu voluntarily and wait!
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 22
LM
BE
NC
H P
age-
Fau
lt
Call Group Relations
Program-OS Call Graph
Controlled Experiments
Observing Exceptions
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 23
Merging App / OS Traces
MPI_Send OS Routines
Fine-grained Tracing
Shows detail inside interrupts and bottom halves
Using VAMPIR Trace Visualization [VAMPIR]
Controlled Experiments
Tracing
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 24
Larger-Scale Runs Run parallel benchmarks on larger-scale (128 dual-cpu nodes)
Identify (and remove) system-level performance issues Understand perturbation overheads introduced by KTAU
NPB benchmark: LU Application [NPB]
Simulated computational fluid dynamics (CFD) application. A regular-sparse, block lower and upper triangular system solution.
ASC benchmark: Sweep3D [Sweep3d]
Solves a 3-D, time-independent, neutron particle transport equation on an orthogonal mesh.
Test machine: Chiba-City Linux cluster (ANL) 128 dual-CPU Pentium III, 450MHz, 512MB RAM/node,
Linux 2.6.14.2 (ktau) kernel, connected by Ethernet
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 25
Larger-Scale Runs
Experienced problems on Chiba by chance Initially ran NPB-LU and Sweep3D codes on 128x1
configuration Then ran on 64x2 configuration Extreme performance hit (72% slower!) with the 64x2 runs Used KTAU views to identify and solve issues iteratively Eventually brought performance gap to 13% for LU and 9%
for Sweep.
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 26
Larger-scale Runs
Two ranks - relatively very low MPI_Recv() time.
Two ranks - MPI_Recv() diff. from Mean in
OS-SCHED.
User-level MPI_Recv MPI_Recv OS Interactions
64x2 Configuration
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 27
Larger-scale Runs
Two ranks have very low voluntary
scheduling durations.
(Same) Two ranks have very large
preemptive scheduling.
Voluntary Scheduling Preemptive Scheduling
Note: x-axis log scale
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 28
Larger-scale Runs
NPB LU processes PID:4066, PID:4068
active. No other significant activity!
Why the Pre-emption?
64x2 Pinned: Interrupt Activity Bimodal across MPI ranks.
ccn10 Node-level View Interrupt Activity
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 29
Larger-scale Runs
Many more OS-TCP CallsApprox. 100% longer
100% More background OS-TCP activity in
Compute phase.More imbalance!
Use ‘Merged’ performance data to identify overlap, imbalance.Why does purely compute bound region have lots of I/O?
TCP within Compute : Time TCP within Compute : Calls
1. I/O Compute Overlap- Explicit MPI or OS buffering - Tcp Sends
2. Imbalance in workload- Tcp Recvs from ontime finishers
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 30
Larger-scale Runs
OS-TCP in SMP Costlier
IRQ-Balancing blindly distributes interrupts and bottom-halves. E.g.: Handling TCP related BH in CPU-0 for LU-process on CPU-1
Cache issues! [COMSWARE]
Cost / Call of OS-level TCP
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 31
Perturbation Study
Five different Configurations Baseline: Vanilla kernel, un-instrumented benchmark Ktau-Off: Kernel instrumentations compiled-in. But turned
Off - (disabled probe effect) Prof-All: All kernel instrumentations turned On Prof-Sched: Only scheduler subssystem’s instrumentations
turned on Prof-All+TAU: ProfAll, + user-level Tau instrumentation
enabled NPB LU application benchmark:
16 nodes, 5 different configurations, Mean over 5 runs each ASC Sweep3D:
128 nodes, Base and Prof-All+TAU, Mean over 5 runs each. Test machine: Chiba-City ANL
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 32
Perturbation Study
Disabled probe effect. Single
instrumentation very cheap.
E.g. Scheduling.
Complete Integrated Profiling Cost under
3% on Avg. and as low as 1.58%.
Sweep3d on 128 Nodes
Base ProfAll+TAU
Elapsed Time: 368.25 369.9
% Avg Slow.: 0.49%
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 33
Future Work
Dynamic measurement control Improve performance data sources Improve integration with TAU’s user-space capabilities
Better correlation of user and kernel performance Full callpaths and phase-based profiling Merged user/kernel traces (already available)
Integration of TAU and KTAU with Supermon Porting efforts to IA-64, PPC-64, and AMD Opteron ZeptoOS characterization efforts
BGL I/O node Dynamically adaptive kernels
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 34
Recent Work on ZeptoOS Project
Accurate Identification of “noise” sources Modified Linux on BG/L should be efficient Effect of OS “noise” on synchronization / collectives What OS aspects induce what types of interference
code paths configurations devices attached
Requires user-level and OS measurement If can identify noise sources, then can remove or
alleviate interference
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 35
Approach
ANL Selfish benchmark to identify “detours” Noise events in user-space Shows durations and frequencies of events Does NOT show cause or source Runs a tight loop with an expected (ideal) duration
logs times and duration of detours
Use KTAU OS-tracing to record OS activity Correlate time of occurrence
uses same time source as Selfish benchmark Infer type of OS-activity (if any) causing the “detour”
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 36
OS/User Performance View of Scheduling
preemptivescheduling
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 37
OS/User View of OS Background Activity
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 38
OS/User View of OS Background Activity
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 39
Other Applications of KTAU
Need to fill this in … (is there time)
Cluster 2006Kernel-level Measurement for Integrated Parallel
Performance Views 40
Acknowledgements
Department of Energy’s Office of Science National Science Foundation University of Oregon (UO) Core Team
Aroon Nataraj, PhD Student Prof. Allen D Malony Dr. Sameer Shende, Senior Scientist Alan Morris, Senior Software Engineer Suravee Suthikulpanit , MS Student (Graduated)
Argonne National Lab (ANL) Contributors Pete Beckman Kamil Iskra Kazutomo Yoshii